Analyzing gender bias in movie dialogues

18 minute read

In this post, I’ve tried to analyze gender bias in Hollywood movies using the character dialogues & some movie metadata. The gender bias can be established if we can predict the gender of a Hollywood movie character based on his / her dialogues in the movie. The dataset was released by Cornell University. The data pre-processing, EDA & modeling are all done in Python3 in a Jupyter notebook environment, rendered finally into a Markdown for this blog.

Import necessary libraries

import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier,BaggingClassifier
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn import svm
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC

from imblearn.under_sampling import RandomUnderSampler
import eli5

import IPython
from IPython.display import display
import graphviz
from sklearn.tree import export_graphviz
import re


warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', 100)

/opt/conda/lib/python3.7/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.metrics.scorer module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.
  warnings.warn(message, FutureWarning)
/opt/conda/lib/python3.7/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.feature_selection.base module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.feature_selection. Anything that cannot be imported from sklearn.feature_selection is now part of the private API.
  warnings.warn(message, FutureWarning)

Reading the dataset

lines_df = pd.read_csv('../input/movie_lines.tsv', sep='\t', error_bad_lines=False,
                       warn_bad_lines=False, header=None)
characters_df = pd.read_csv('../input/movie_characters_metadata.tsv', sep='\t', 
                            warn_bad_lines=False, error_bad_lines=False, header=None)

characters_df.head()

	0	1	2	3	4	5
0	u0	BIANCA	m0	10 things i hate about you	f	4
1	u1	BRUCE	m0	10 things i hate about you	?	?
2	u2	CAMERON	m0	10 things i hate about you	m	3
3	u3	CHASTITY	m0	10 things i hate about you	?	?
4	u4	JOEY	m0	10 things i hate about you	m	6

Adding column names to characters dataframe

characters_df.columns=['chId','chName','mId','mName','gender','posCredits']
characters_df.head()

	chId	chName	mId	mName	gender	posCredits
0	u0	BIANCA	m0	10 things i hate about you	f	4
1	u1	BRUCE	m0	10 things i hate about you	?	?
2	u2	CAMERON	m0	10 things i hate about you	m	3
3	u3	CHASTITY	m0	10 things i hate about you	?	?
4	u4	JOEY	m0	10 things i hate about you	m	6

characters_df.shape

(9034, 6)

Checking the distribution of gender in the characters dataset

characters_df.gender.value_counts()

?    6008
m    1899
f     921
M     145
F      44
Name: gender, dtype: int64

We need to clean this column. Let’s also remove the characters where gender information is not available. We’ll assign a label of 0 to male characters & 1 to female characters.

characters_df = characters_df[characters_df.gender != '?']
characters_df.gender = characters_df.gender.apply(lambda g: 0 if g in ['m', 'M'] else 1)  ## Label encoding

characters_df.shape

(3026, 6)

characters_df.gender.value_counts()

0    2044
1     982
Name: gender, dtype: int64

Let’s also take a look at the position of the character in the post credits of the movie

characters_df.posCredits.value_counts()

     497
     443
     352
?       330
     268
     211
     169
     125
     100
      79
     54
     40
   38
     33
     32
     26
     24
     24
     19
     18
     14
     13
      9
      8
      7
      6
      5
      5
      5
      4
      4
      4
      4
      3
      3
      3
      3
      2
      2
      2
      2
      2
      2
      2
      1
      1
      1
      1
      1
      1
      1
      1
      1
      1
      1
      1
      1
Name: posCredits, dtype: int64

The position of characters in the credits section seems to be a useful feature for classification. We can try to use it as a categorical variable later. But let’s combine the low frequency ones together first.

characters_df.posCredits = characters_df.posCredits.apply(lambda p: '10+' if not p in ['1', '2', '3', '4', '5', '6', '7', '8', '9'] else p)  ## Label encoding
characters_df.posCredits.value_counts()

10+    782
    497
    443
    352
    268
    211
    169
    125
    100
     79
Name: posCredits, dtype: int64

Let’s clean the lines dataframe now!

lines_df.columns = ['lineId','chId','mId','chName','dialogue']
lines_df.head()

	lineId	chId	mId	chName	dialogue
0	L1045	u0	m0	BIANCA	They do not!
1	L1044	u2	m0	CAMERON	They do to!
2	L985	u0	m0	BIANCA	I hope so.
3	L984	u2	m0	CAMERON	She okay?
4	L925	u0	m0	BIANCA	Let's go.

Let’s join lines_df and characters_df together.

df = pd.merge(lines_df, characters_df, how='inner', on=['chId','mId', 'chName'],
         left_index=False, right_index=False, sort=True,
         copy=False, indicator=False)
df.head()

	lineId	chId	mId	chName	dialogue	mName	gender	posCredits
0	L1045	u0	m0	BIANCA	They do not!	10 things i hate about you	1	4
1	L985	u0	m0	BIANCA	I hope so.	10 things i hate about you	1	4
2	L925	u0	m0	BIANCA	Let's go.	10 things i hate about you	1	4
3	L872	u0	m0	BIANCA	Okay -- you're gonna need to learn how to lie.	10 things i hate about you	1	4
4	L869	u0	m0	BIANCA	Like my fear of wearing pastels?	10 things i hate about you	1	4

df.shape

(229309, 8)

Remove empty dialogues from the dataset

df = df[df['dialogue'].notnull()]
df.shape

(229106, 8)

Let’s check what kind of movie metadata we can add to our dataset.

movies = pd.read_csv("../input/movie_titles_metadata.tsv", sep='\t', error_bad_lines=False,
                       warn_bad_lines=False, header=None)
movies.columns = ['mId','mName','releaseYear','rating','votes','genres']

movies.head()

	mId	mName	releaseYear	rating	votes	genres
0	m0	10 things i hate about you	1999	6.9	62847.0	['comedy' 'romance']
1	m1	1492: conquest of paradise	1992	6.2	10421.0	['adventure' 'biography' 'drama' 'history']
2	m2	15 minutes	2001	6.1	25854.0	['action' 'crime' 'drama' 'thriller']
3	m3	2001: a space odyssey	1968	8.4	163227.0	['adventure' 'mystery' 'sci-fi']
4	m4	48 hrs.	1982	6.9	22289.0	['action' 'comedy' 'crime' 'drama' 'thriller']

movie_yr = movies[['mId', 'releaseYear']]
movie_yr.releaseYear = pd.to_numeric(movie_yr.releaseYear.apply(lambda y: str(y)[0:4]), errors='coerce')
movie_yr = movie_yr.dropna()

We will just add the year of movie release to our dataset.

df = pd.merge(df, movie_yr, how='inner', on=['mId'],
         left_index=False, right_index=False, sort=True,
         copy=False, indicator=False)
df.head()

	lineId	chId	mId	chName	dialogue	mName	gender	posCredits	releaseYear
0	L1045	u0	m0	BIANCA	They do not!	10 things i hate about you	1	4	1999.0
1	L985	u0	m0	BIANCA	I hope so.	10 things i hate about you	1	4	1999.0
2	L925	u0	m0	BIANCA	Let's go.	10 things i hate about you	1	4	1999.0
3	L872	u0	m0	BIANCA	Okay -- you're gonna need to learn how to lie.	10 things i hate about you	1	4	1999.0
4	L869	u0	m0	BIANCA	Like my fear of wearing pastels?	10 things i hate about you	1	4	1999.0

Feature Engineering

Length of lines
Count of lines
One hot encodings for tokens

df['lineLength'] = df.dialogue.str.len()             ## Length of each line by characters
df['wordCountLine'] = df.dialogue.str.count(' ') + 1 ## Length of each line by words
df.head()

	lineId	chId	mId	chName	dialogue	mName	gender	posCredits	releaseYear	lineLength	wordCountLine
0	L1045	u0	m0	BIANCA	They do not!	10 things i hate about you	1	4	1999.0	12	3
1	L985	u0	m0	BIANCA	I hope so.	10 things i hate about you	1	4	1999.0	10	3
2	L925	u0	m0	BIANCA	Let's go.	10 things i hate about you	1	4	1999.0	9	2
3	L872	u0	m0	BIANCA	Okay -- you're gonna need to learn how to lie.	10 things i hate about you	1	4	1999.0	46	10
4	L869	u0	m0	BIANCA	Like my fear of wearing pastels?	10 things i hate about you	1	4	1999.0	32	6

Next, let’s convert the dialogues into clean tokens

Remove Stopwords : because they occur very often, but serve no meaning. For eg. : is,am,are,the.
Turn all word to smaller cases : I, i -> i
Lemmatization: convert words to their root form. For eg., walk,walks -> walk or geographical,geographic -> geographic

wordnet_lemmatizer = WordNetLemmatizer()
def clean_dialogue( dialogue ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    # Source : https://www.kaggle.com/akerlyn/wordcloud-based-on-character
    #
    # 1. Remove HTML
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", dialogue) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))   
    
    # 5. Use lemmatization and remove stop words
    meaningful_words = [wordnet_lemmatizer.lemmatize(w) for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))

df['cleaned_dialogue'] = df['dialogue'].apply(clean_dialogue)
df[['dialogue','cleaned_dialogue']].sample(5)

	dialogue	cleaned_dialogue
14684	Thank you.	thank
58592	Hi tough guy. I guess it worked huh?	hi tough guy guess worked huh
209279	I've decided not to open a practice here I wa...	decided open practice want set research clinic...
95420	Am I suppose to be this sore?	suppose sore
50378	You could still always give Becker an itch. 'C...	could still always give becker itch course mig...

Create training dataset

Now, we can aggregate all data for a particular movie character into 1 record. We will combine their dialogue tokens from the entire movie, calculate their median dialogue length by characters & words, and count their total no of lines in the movie.

train = df.groupby(['chId', 'mId', 'chName', 'gender', 'posCredits','releaseYear']). \
            agg({'lineLength' : ['median'], 
                 'wordCountLine' : ['median'],
                 'chId' : ['count'],
                 'cleaned_dialogue' : [lambda x : ' '.join(x)]
                })

## Renaming columns by aggregate functions
train.columns = ["_".join(x) for x in train.columns.ravel()]

train.reset_index(inplace=True)
train

	chId	mId	chName	gender	posCredits	releaseYear	lineLength_median	wordCountLine_median	chId_count	cleaned_dialogue_<lambda>
0	u0	m0	BIANCA	1	4	1999.0	34.0	7.0	94	hope let go okay gonna need learn lie like fe...
1	u100	m6	AMY	1	7	1999.0	23.0	4.0	31	died sleep three day ago paper tom dead calli...
2	u1003	m65	RICHARD	0	3	1996.0	24.5	5.0	70	asked would said room room serious foolin arou...
3	u1005	m65	SETH	0	2	1996.0	37.0	8.0	163	let follow said new jesus christ carlos brothe...
4	u1008	m66	C.O.	0	10+	1997.0	48.0	9.0	33	course uh v p security arrangement generally t...
...	...	...	...	...	...	...	...	...	...	...
2946	u980	m63	VICTOR	0	3	1931.0	32.0	6.0	126	never said name remembers kill draw line take...
2947	u983	m64	ALICE	1	10+	2009.0	30.0	6.0	51	maybe wait mr christy killer still bill bill b...
2948	u985	m64	BILL	0	10+	2009.0	20.0	4.0	39	twenty mile crossroad steve back hour thing st...
2949	u997	m65	JACOB	0	1	1996.0	36.0	6.0	90	meant son daughter oh daughter bathroom vacati...
2950	u998	m65	KATE	1	4	1996.0	20.0	4.5	44	everybody go home going em swear god father ch...

2951 rows × 10 columns

Let’s check some feature distributions by gender

sns.boxplot(data = train, x = 'gender', y = 'chId_count', hue = 'gender')

<matplotlib.axes._subplots.AxesSubplot at 0x7fe92d3d57d0>

png

The chId_count here refers to the no of lines given to the character in the movie. While the median value seems to be roughly similar for both males & females, the upper bound seems to be higher for males.

sns.boxplot(data = train, x = 'gender', y = 'wordCountLine_median', hue = 'gender')

<matplotlib.axes._subplots.AxesSubplot at 0x7fe92d108550>

png

The count of words per dialogue is higher for male characters than that for female characters!

sns.boxplot(data = train, x = 'gender', y = 'lineLength_median', hue = 'gender')

<matplotlib.axes._subplots.AxesSubplot at 0x7fe92d0a1dd0>

png

The median length of a dialogue also seems to be higher for males.

sns.scatterplot(data = train, x = 'wordCountLine_median', y = 'chId_count', hue = 'gender', alpha = 0.5) 

<matplotlib.axes._subplots.AxesSubplot at 0x7fe93902b810>

png

Again, in the scatter plot, we see female characters, ie yellow points, generally closer to the origin, as they have smaller dialogues & lesser dialogues per movie, while male characters denoted by blue dots are more outward from the origin.

Train test split

Now, we can split our data into a training set & a validation set.

## Separating labels from features
y = train['gender']
X = train.copy()
X.drop('gender', axis=1, inplace=True)

## Removing unnecessary columns
X.drop('chId', axis=1, inplace=True)
X.drop('mId', axis=1, inplace=True)
X.drop('chName', axis=1, inplace=True)
X.head()

	posCredits	releaseYear	lineLength_median	wordCountLine_median	chId_count	cleaned_dialogue_<lambda>
0	4	1999.0	34.0	7.0	94	hope let go okay gonna need learn lie like fe...
1	7	1999.0	23.0	4.0	31	died sleep three day ago paper tom dead calli...
2	3	1996.0	24.5	5.0	70	asked would said room room serious foolin arou...
3	2	1996.0	37.0	8.0	163	let follow said new jesus christ carlos brothe...
4	10+	1997.0	48.0	9.0	33	course uh v p security arrangement generally t...

We will pick equal no of records for both male & female characters to avoid any kind of bias due to no of records.

undersample = RandomUnderSampler(sampling_strategy='majority')
X_under, y_under = undersample.fit_resample(X, y)
y_under.value_counts()

1    948
0    948
Name: gender, dtype: int64

We’ll also try to keep equal no of male & female records in the train & validation datasets

X_train, X_val, y_train, y_val = train_test_split(X_under, y_under, test_size=0.2, random_state = 10, stratify=y_under)

y_val.value_counts()

1    190
0    190
Name: gender, dtype: int64

X_val.head()

	posCredits	releaseYear	lineLength_median	wordCountLine_median	chId_count	cleaned_dialogue_<lambda>
1236	2	2001.0	33.0	6.0	60	latitude degree maybe satellite said three ye...
924	10+	1974.0	23.0	5.0	23	sure okay okay got boat plus owe know oh gee m...
868	4	2001.0	34.0	6.0	34	going coast alan idea alive headed need stick...
363	1	1999.0	33.0	7.0	146	poor woman carole wound could hope pacify evas...
989	10+	2000.0	42.0	8.0	15	okay took trouble come got principle selling o...

Pipeline for classifiers

Since our dataset includes both numerical features & NLP tokens, we’ll use a special converter class in our pipeline.

class Converter(BaseEstimator, TransformerMixin):
    ## Source : https://www.kaggle.com/tylersullivan/classifying-phishing-urls-three-models
    def fit(self, x, y=None):
        return self

    def transform(self, data_frame):
        return data_frame.values.ravel()

Pipeline for numeric features

numeric_features = ['lineLength_median', 'wordCountLine_median', 'chId_count', 'releaseYear']

numeric_transformer = Pipeline(steps=[('scaler', MinMaxScaler())])

Pipeline for tokens dereived from dialogues

categorical_features = ['posCredits']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

vectorizer_features = ['cleaned_dialogue_<lambda>']
vectorizer_transformer = Pipeline(steps=[
    ('con', Converter()),
    ('tf', TfidfVectorizer())])

Now, we can combine preprocessing pipelines with the classifers. We will try 4 basic models:

Linear Support Vector Classifier
Logistic Regression Classifier
Naive Bayes Classifier
Random Forest Clasifier

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('vec', vectorizer_transformer, vectorizer_features)
    ])

svc_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', CalibratedClassifierCV(LinearSVC()))])  ## LinearSVC has no predict_proba method

log_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

nb_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', MultinomialNB())])

rf_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(n_estimators=120, min_samples_leaf=10, 
                                                            max_features=0.7, n_jobs=-1, oob_score=True))])

Fitting the preprocessing & classifier pipelines on training data

svc_clf.fit(X_train, y_train)
log_clf.fit(X_train, y_train)
nb_clf.fit(X_train, y_train)
rf_clf.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   MinMaxScaler())]),
                                                  ['lineLength_median',
                                                   'wordCountLine_median',
                                                   'chId_count',
                                                   'releaseYear']),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['posCredits']),
                                                 ('vec',
                                                  Pipeline(steps=[('con',
                                                                   Converter()),
                                                                  ('tf',
                                                                   TfidfVectorizer())]),
                                                  ['cleaned_dialogue_<lambda>'])])),
                ('classifier',
                 RandomForestClassifier(max_features=0.7, min_samples_leaf=10,
                                        n_estimators=120, n_jobs=-1,
                                        oob_score=True))])

Check results on the validation set

def results(name: str, model: BaseEstimator) -> None:
    '''
    Custom function to check model performance on validation set
    '''
    preds = model.predict(X_val)

    print(name + " score: %.3f" % model.score(X_val, y_val))
    print(classification_report(y_val, preds))
    labels = ['Male', 'Female']

    conf_matrix = confusion_matrix(y_val, preds)
    plt.figure(figsize= (10,6))
    sns.heatmap(conf_matrix, xticklabels=labels, yticklabels=labels, annot=True, fmt="d", cmap='Blues')
    plt.title("Confusion Matrix for " + name)
    plt.ylabel('True Class')
    plt.xlabel('Predicted Class')

results("SVC" , svc_clf)
results("Logistic Regression" , log_clf)
results("Naive Bayes" , nb_clf)
results("Random Forest" , rf_clf)

SVC score: 0.787
              precision    recall  f1-score   support

           0       0.77      0.82      0.79       190
           1       0.80      0.76      0.78       190

    accuracy                           0.79       380
   macro avg       0.79      0.79      0.79       380
weighted avg       0.79      0.79      0.79       380

Logistic Regression score: 0.768
              precision    recall  f1-score   support

           0       0.77      0.77      0.77       190
           1       0.77      0.76      0.77       190

    accuracy                           0.77       380
   macro avg       0.77      0.77      0.77       380
weighted avg       0.77      0.77      0.77       380

Naive Bayes score: 0.761
              precision    recall  f1-score   support

           0       0.74      0.80      0.77       190
           1       0.78      0.72      0.75       190

    accuracy                           0.76       380
   macro avg       0.76      0.76      0.76       380
weighted avg       0.76      0.76      0.76       380

Random Forest score: 0.721
              precision    recall  f1-score   support

           0       0.72      0.72      0.72       190
           1       0.72      0.72      0.72       190

    accuracy                           0.72       380
   macro avg       0.72      0.72      0.72       380
weighted avg       0.72      0.72      0.72       380

png

We see that Linear SVC performs the best classification with an accuracy & F1 score of ~79% !!

From the confusion matrix, we can see that out of the 190 male characters in the validation dataset, SVC model classified 155 of them correctly as males, and the remaining 35 incorrectly as females. Similarly, out of 190 female characters in the validation dataset, 144 were classified correctly & 46 classified incorrectly.

Logistic Regression & Naive Bayes classifiers are close at 77% & 76% accuracies respectively. These results are not close to state of the art, but are still pretty good.

Let’s now explore what features contribute the most to our classifiers performance through some model explainability techniques.

Feature importance

Creating a list of all features including numeric, categorical & vectorised features.

vect_columns = list(svc_clf.named_steps['preprocessor'].named_transformers_['vec'].named_steps['tf'].get_feature_names())
onehot_columns = list(svc_clf.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names(input_features=categorical_features))
numeric_features_list = list(numeric_features)
numeric_features_list.extend(onehot_columns)
numeric_features_list.extend(vect_columns)

Feature importance for Logistic Regression

lr_weights = eli5.explain_weights_df(log_clf.named_steps['classifier'], top=30, feature_names=numeric_features_list)
lr_weights.head(15)

	target	feature	weight
0	1	oh	2.914220
1	1	really	1.598428
2	1	love	1.582584
3	1	hi	1.173297
4	1	said	1.161967
5	1	want	1.053116
6	1	like	1.020965
7	1	darling	0.992410
8	1	never	0.987203
9	1	child	0.975300
10	1	please	0.970647
11	1	god	0.941234
12	1	know	0.913415
13	1	honey	0.903731
14	1	school	0.897518

lr_weights.tail(14)

	target	feature	weight
16	1	son	-0.876330
17	1	good	-0.897884
18	1	right	-0.899643
19	1	chId_count	-0.916359
20	1	fuck	-0.996575
21	1	fuckin	-1.049543
22	1	yeah	-1.091158
23	1	hell	-1.137874
24	1	got	-1.162604
25	1	shit	-1.162634
26	1	sir	-1.201654
27	1	gotta	-1.246364
28	1	hey	-1.549047
29	1	man	-2.188670

We see that dialogue keywords like oh, love, like, darling, want, honey are strong indicators that the character is a female, while keywords like son, sir, man, hell, gotta, yeah & most cuss words are usually found in the dialogues of male characters of the given Hollywood movies!

Let’s also try to visualize a single decision tree

We can training a single decision tree using the Random Forest Classifier.

m = RandomForestClassifier(n_estimators=1, min_samples_leaf=5, max_depth = 3, 
                           oob_score=True, random_state = np.random.seed(123))
dt_clf = Pipeline(steps=[('preprocessor', preprocessor),
                         ('classifier', m)])

dt_clf.fit(X_train, y_train)
results("Decision Tree Classifier", dt_clf)

Decision Tree Classifier score: 0.526
              precision    recall  f1-score   support

           0       0.54      0.37      0.44       190
           1       0.52      0.68      0.59       190

    accuracy                           0.53       380
   macro avg       0.53      0.53      0.51       380
weighted avg       0.53      0.53      0.51       380

png

While a single decision is a poor classifier with accuracy barely more than 50%, we see that bagging enough of such weak classifiers to form a Random Forest model helps us improve the model performance drastically! Let’s look at how the splits are made for a single decision tree.

'''
def draw_tree(t, df, size=10, ratio=0.6, precision=0):
    """ 
    Draws a representation of a decition tree in IPython. Source : fastai v0.7
    Have commented the function definition here due to a Jekyll build error related to Liquid objects.
    """
    s=export_graphviz(t, out_file=None, feature_names=numeric_features_list, filled=True,
                      special_characters=True, rotate=True, precision=precision, 
                      proportion=True, class_names = ["male", "female"], impurity = False)
    IPython.display.display(graphviz.Source(re.sub('Tree {',
       f'Tree size={size}; ratio={ratio}', s)))
'''

draw_tree(m.estimators_[0], X_train, precision=2)

svg

Here, the blue coloured nodes indicate their majority class is female while the orange colored nodes have a majority of male labels. The decision tree starts with a mixed sample, but the leaves of the tree are biased towards one class or the other. Most splits seem to be happening using dialogue tokens. For eg., in the above tree, if the tf-idf frequency of keywords think is > 0.1 & kid is > 0.03, the samples are classified as female.

Feature importance for the Random Forest model

eli5.explain_weights_df(rf_clf.named_steps['classifier'], top=30, feature_names=numeric_features_list)

	feature	weight	std
0	oh	0.077328	0.027570
1	man	0.040615	0.030255
2	love	0.022664	0.026011
3	shit	0.019557	0.027031
4	said	0.017135	0.018359
5	lineLength_median	0.015757	0.017705
6	got	0.013712	0.018381
7	really	0.013227	0.018366
8	hey	0.012435	0.019035
9	good	0.012253	0.017000
10	look	0.011174	0.016425
11	right	0.009775	0.012608
12	sir	0.009673	0.018090
13	think	0.009661	0.014078
14	know	0.009660	0.012629
15	em	0.008794	0.019125
16	like	0.008730	0.011206
17	understand	0.008263	0.013995
18	want	0.008112	0.012310
19	yeah	0.007471	0.014095
20	get	0.007464	0.010990
21	would	0.007399	0.010665
22	chId_count	0.007074	0.010570
23	come	0.006845	0.010507
24	god	0.006782	0.011838
25	releaseYear	0.006504	0.009230
26	hi	0.006305	0.016955
27	one	0.006265	0.009646
28	gotta	0.006075	0.014320
29	child	0.005858	0.013265

We see that the median length of a dialogue, total no of lines (chId_count) & movie release year are important features along with the tokens extracted from the character’s dialogues for the Random Forest model!

Next Steps

Some possible ways to further improve the classifier performance could be:

using bi-grams or tri-grams for dialogue tokens
Adding features related to sentiments extracted from dialogues
Adding a feature that measures the level of objectivity or subjectivity of a dialogue
hyper-parameter tuning of our model parameters
trying out XGBoost or neural network models

Still, our current best model (Linear SVC) can classify roughly 4 out of 5 movie characters (79% accuracy) correctly using the dialogues they speak, and some movie metadata like release year and position of character in the movie credits. We can safely say that our model is able to capture the gender specific bias in the characters of Hollywood movies.

If you would like to play around with the code, the complete Jupyter notebook is available here on Kaggle.

Pritesh Shrivastava