Analyzing gender bias in movie dialogues

18 minute read

In this post, I’ve tried to analyze gender bias in Hollywood movies using the character dialogues & some movie metadata. The gender bias can be established if we can predict the gender of a Hollywood movie character based on his / her dialogues in the movie. The dataset was released by Cornell University. The data pre-processing, EDA & modeling are all done in Python3 in a Jupyter notebook environment, rendered finally into a Markdown for this blog.

Import necessary libraries

import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier,BaggingClassifier
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn import svm
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC

from imblearn.under_sampling import RandomUnderSampler
import eli5

import IPython
from IPython.display import display
import graphviz
from sklearn.tree import export_graphviz
import re


warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', 100)
/opt/conda/lib/python3.7/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.metrics.scorer module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.
  warnings.warn(message, FutureWarning)
/opt/conda/lib/python3.7/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.feature_selection.base module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.feature_selection. Anything that cannot be imported from sklearn.feature_selection is now part of the private API.
  warnings.warn(message, FutureWarning)

Reading the dataset

lines_df = pd.read_csv('../input/movie_lines.tsv', sep='\t', error_bad_lines=False,
                       warn_bad_lines=False, header=None)
characters_df = pd.read_csv('../input/movie_characters_metadata.tsv', sep='\t', 
                            warn_bad_lines=False, error_bad_lines=False, header=None)

characters_df.head()
0 1 2 3 4 5
0 u0 BIANCA m0 10 things i hate about you f 4
1 u1 BRUCE m0 10 things i hate about you ? ?
2 u2 CAMERON m0 10 things i hate about you m 3
3 u3 CHASTITY m0 10 things i hate about you ? ?
4 u4 JOEY m0 10 things i hate about you m 6

Adding column names to characters dataframe

characters_df.columns=['chId','chName','mId','mName','gender','posCredits']
characters_df.head()
chId chName mId mName gender posCredits
0 u0 BIANCA m0 10 things i hate about you f 4
1 u1 BRUCE m0 10 things i hate about you ? ?
2 u2 CAMERON m0 10 things i hate about you m 3
3 u3 CHASTITY m0 10 things i hate about you ? ?
4 u4 JOEY m0 10 things i hate about you m 6
characters_df.shape
(9034, 6)

Checking the distribution of gender in the characters dataset

characters_df.gender.value_counts()
?    6008
m    1899
f     921
M     145
F      44
Name: gender, dtype: int64

We need to clean this column. Let’s also remove the characters where gender information is not available. We’ll assign a label of 0 to male characters & 1 to female characters.

characters_df = characters_df[characters_df.gender != '?']
characters_df.gender = characters_df.gender.apply(lambda g: 0 if g in ['m', 'M'] else 1)  ## Label encoding

characters_df.shape
(3026, 6)
characters_df.gender.value_counts()
0    2044
1     982
Name: gender, dtype: int64

Let’s also take a look at the position of the character in the post credits of the movie

characters_df.posCredits.value_counts()
1       497
2       443
3       352
?       330
4       268
5       211
6       169
7       125
8       100
9        79
10       54
11       40
1000     38
13       33
12       32
16       26
14       24
18       24
17       19
19       18
15       14
21       13
22        9
20        8
29        7
27        6
24        5
25        5
26        5
45        4
23        4
31        4
35        4
38        3
43        3
33        3
34        3
36        2
59        2
39        2
30        2
42        2
28        2
32        2
51        1
82        1
44        1
70        1
46        1
41        1
63        1
37        1
50        1
49        1
47        1
62        1
71        1
Name: posCredits, dtype: int64

The position of characters in the credits section seems to be a useful feature for classification. We can try to use it as a categorical variable later. But let’s combine the low frequency ones together first.

characters_df.posCredits = characters_df.posCredits.apply(lambda p: '10+' if not p in ['1', '2', '3', '4', '5', '6', '7', '8', '9'] else p)  ## Label encoding
characters_df.posCredits.value_counts()
10+    782
1      497
2      443
3      352
4      268
5      211
6      169
7      125
8      100
9       79
Name: posCredits, dtype: int64

Let’s clean the lines dataframe now!

lines_df.columns = ['lineId','chId','mId','chName','dialogue']
lines_df.head()
lineId chId mId chName dialogue
0 L1045 u0 m0 BIANCA They do not!
1 L1044 u2 m0 CAMERON They do to!
2 L985 u0 m0 BIANCA I hope so.
3 L984 u2 m0 CAMERON She okay?
4 L925 u0 m0 BIANCA Let's go.

Let’s join lines_df and characters_df together.

df = pd.merge(lines_df, characters_df, how='inner', on=['chId','mId', 'chName'],
         left_index=False, right_index=False, sort=True,
         copy=False, indicator=False)
df.head()
lineId chId mId chName dialogue mName gender posCredits
0 L1045 u0 m0 BIANCA They do not! 10 things i hate about you 1 4
1 L985 u0 m0 BIANCA I hope so. 10 things i hate about you 1 4
2 L925 u0 m0 BIANCA Let's go. 10 things i hate about you 1 4
3 L872 u0 m0 BIANCA Okay -- you're gonna need to learn how to lie. 10 things i hate about you 1 4
4 L869 u0 m0 BIANCA Like my fear of wearing pastels? 10 things i hate about you 1 4
df.shape
(229309, 8)

Remove empty dialogues from the dataset

df = df[df['dialogue'].notnull()]
df.shape
(229106, 8)

Let’s check what kind of movie metadata we can add to our dataset.

movies = pd.read_csv("../input/movie_titles_metadata.tsv", sep='\t', error_bad_lines=False,
                       warn_bad_lines=False, header=None)
movies.columns = ['mId','mName','releaseYear','rating','votes','genres']

movies.head()
mId mName releaseYear rating votes genres
0 m0 10 things i hate about you 1999 6.9 62847.0 ['comedy' 'romance']
1 m1 1492: conquest of paradise 1992 6.2 10421.0 ['adventure' 'biography' 'drama' 'history']
2 m2 15 minutes 2001 6.1 25854.0 ['action' 'crime' 'drama' 'thriller']
3 m3 2001: a space odyssey 1968 8.4 163227.0 ['adventure' 'mystery' 'sci-fi']
4 m4 48 hrs. 1982 6.9 22289.0 ['action' 'comedy' 'crime' 'drama' 'thriller']
movie_yr = movies[['mId', 'releaseYear']]
movie_yr.releaseYear = pd.to_numeric(movie_yr.releaseYear.apply(lambda y: str(y)[0:4]), errors='coerce')
movie_yr = movie_yr.dropna()

We will just add the year of movie release to our dataset.

df = pd.merge(df, movie_yr, how='inner', on=['mId'],
         left_index=False, right_index=False, sort=True,
         copy=False, indicator=False)
df.head()
lineId chId mId chName dialogue mName gender posCredits releaseYear
0 L1045 u0 m0 BIANCA They do not! 10 things i hate about you 1 4 1999.0
1 L985 u0 m0 BIANCA I hope so. 10 things i hate about you 1 4 1999.0
2 L925 u0 m0 BIANCA Let's go. 10 things i hate about you 1 4 1999.0
3 L872 u0 m0 BIANCA Okay -- you're gonna need to learn how to lie. 10 things i hate about you 1 4 1999.0
4 L869 u0 m0 BIANCA Like my fear of wearing pastels? 10 things i hate about you 1 4 1999.0

Feature Engineering

  • Length of lines
  • Count of lines
  • One hot encodings for tokens
df['lineLength'] = df.dialogue.str.len()             ## Length of each line by characters
df['wordCountLine'] = df.dialogue.str.count(' ') + 1 ## Length of each line by words
df.head()
lineId chId mId chName dialogue mName gender posCredits releaseYear lineLength wordCountLine
0 L1045 u0 m0 BIANCA They do not! 10 things i hate about you 1 4 1999.0 12 3
1 L985 u0 m0 BIANCA I hope so. 10 things i hate about you 1 4 1999.0 10 3
2 L925 u0 m0 BIANCA Let's go. 10 things i hate about you 1 4 1999.0 9 2
3 L872 u0 m0 BIANCA Okay -- you're gonna need to learn how to lie. 10 things i hate about you 1 4 1999.0 46 10
4 L869 u0 m0 BIANCA Like my fear of wearing pastels? 10 things i hate about you 1 4 1999.0 32 6

Next, let’s convert the dialogues into clean tokens

  1. Remove Stopwords : because they occur very often, but serve no meaning. For eg. : is,am,are,the.
  2. Turn all word to smaller cases : I, i -> i
  3. Lemmatization: convert words to their root form. For eg., walk,walks -> walk or geographical,geographic -> geographic
wordnet_lemmatizer = WordNetLemmatizer()
def clean_dialogue( dialogue ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    # Source : https://www.kaggle.com/akerlyn/wordcloud-based-on-character
    #
    # 1. Remove HTML
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", dialogue) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))   
    
    # 5. Use lemmatization and remove stop words
    meaningful_words = [wordnet_lemmatizer.lemmatize(w) for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))

df['cleaned_dialogue'] = df['dialogue'].apply(clean_dialogue)
df[['dialogue','cleaned_dialogue']].sample(5)
dialogue cleaned_dialogue
14684 Thank you. thank
58592 Hi tough guy. I guess it worked huh? hi tough guy guess worked huh
209279 I've decided not to open a practice here I wa... decided open practice want set research clinic...
95420 Am I suppose to be this sore? suppose sore
50378 You could still always give Becker an itch. 'C... could still always give becker itch course mig...

Create training dataset

Now, we can aggregate all data for a particular movie character into 1 record. We will combine their dialogue tokens from the entire movie, calculate their median dialogue length by characters & words, and count their total no of lines in the movie.

train = df.groupby(['chId', 'mId', 'chName', 'gender', 'posCredits','releaseYear']). \
            agg({'lineLength' : ['median'], 
                 'wordCountLine' : ['median'],
                 'chId' : ['count'],
                 'cleaned_dialogue' : [lambda x : ' '.join(x)]
                })

## Renaming columns by aggregate functions
train.columns = ["_".join(x) for x in train.columns.ravel()]

train.reset_index(inplace=True)
train
chId mId chName gender posCredits releaseYear lineLength_median wordCountLine_median chId_count cleaned_dialogue_<lambda>
0 u0 m0 BIANCA 1 4 1999.0 34.0 7.0 94 hope let go okay gonna need learn lie like fe...
1 u100 m6 AMY 1 7 1999.0 23.0 4.0 31 died sleep three day ago paper tom dead calli...
2 u1003 m65 RICHARD 0 3 1996.0 24.5 5.0 70 asked would said room room serious foolin arou...
3 u1005 m65 SETH 0 2 1996.0 37.0 8.0 163 let follow said new jesus christ carlos brothe...
4 u1008 m66 C.O. 0 10+ 1997.0 48.0 9.0 33 course uh v p security arrangement generally t...
... ... ... ... ... ... ... ... ... ... ...
2946 u980 m63 VICTOR 0 3 1931.0 32.0 6.0 126 never said name remembers kill draw line take...
2947 u983 m64 ALICE 1 10+ 2009.0 30.0 6.0 51 maybe wait mr christy killer still bill bill b...
2948 u985 m64 BILL 0 10+ 2009.0 20.0 4.0 39 twenty mile crossroad steve back hour thing st...
2949 u997 m65 JACOB 0 1 1996.0 36.0 6.0 90 meant son daughter oh daughter bathroom vacati...
2950 u998 m65 KATE 1 4 1996.0 20.0 4.5 44 everybody go home going em swear god father ch...

2951 rows × 10 columns

Let’s check some feature distributions by gender

sns.boxplot(data = train, x = 'gender', y = 'chId_count', hue = 'gender')
<matplotlib.axes._subplots.AxesSubplot at 0x7fe92d3d57d0>

png

The chId_count here refers to the no of lines given to the character in the movie. While the median value seems to be roughly similar for both males & females, the upper bound seems to be higher for males.

sns.boxplot(data = train, x = 'gender', y = 'wordCountLine_median', hue = 'gender')
<matplotlib.axes._subplots.AxesSubplot at 0x7fe92d108550>

png

The count of words per dialogue is higher for male characters than that for female characters!

sns.boxplot(data = train, x = 'gender', y = 'lineLength_median', hue = 'gender')
<matplotlib.axes._subplots.AxesSubplot at 0x7fe92d0a1dd0>

png

The median length of a dialogue also seems to be higher for males.

sns.scatterplot(data = train, x = 'wordCountLine_median', y = 'chId_count', hue = 'gender', alpha = 0.5) 
<matplotlib.axes._subplots.AxesSubplot at 0x7fe93902b810>

png

Again, in the scatter plot, we see female characters, ie yellow points, generally closer to the origin, as they have smaller dialogues & lesser dialogues per movie, while male characters denoted by blue dots are more outward from the origin.

Train test split

Now, we can split our data into a training set & a validation set.

## Separating labels from features
y = train['gender']
X = train.copy()
X.drop('gender', axis=1, inplace=True)

## Removing unnecessary columns
X.drop('chId', axis=1, inplace=True)
X.drop('mId', axis=1, inplace=True)
X.drop('chName', axis=1, inplace=True)
X.head()
posCredits releaseYear lineLength_median wordCountLine_median chId_count cleaned_dialogue_<lambda>
0 4 1999.0 34.0 7.0 94 hope let go okay gonna need learn lie like fe...
1 7 1999.0 23.0 4.0 31 died sleep three day ago paper tom dead calli...
2 3 1996.0 24.5 5.0 70 asked would said room room serious foolin arou...
3 2 1996.0 37.0 8.0 163 let follow said new jesus christ carlos brothe...
4 10+ 1997.0 48.0 9.0 33 course uh v p security arrangement generally t...

We will pick equal no of records for both male & female characters to avoid any kind of bias due to no of records.

undersample = RandomUnderSampler(sampling_strategy='majority')
X_under, y_under = undersample.fit_resample(X, y)
y_under.value_counts()
1    948
0    948
Name: gender, dtype: int64

We’ll also try to keep equal no of male & female records in the train & validation datasets

X_train, X_val, y_train, y_val = train_test_split(X_under, y_under, test_size=0.2, random_state = 10, stratify=y_under)

y_val.value_counts()
1    190
0    190
Name: gender, dtype: int64
X_val.head()
posCredits releaseYear lineLength_median wordCountLine_median chId_count cleaned_dialogue_<lambda>
1236 2 2001.0 33.0 6.0 60 latitude degree maybe satellite said three ye...
924 10+ 1974.0 23.0 5.0 23 sure okay okay got boat plus owe know oh gee m...
868 4 2001.0 34.0 6.0 34 going coast alan idea alive headed need stick...
363 1 1999.0 33.0 7.0 146 poor woman carole wound could hope pacify evas...
989 10+ 2000.0 42.0 8.0 15 okay took trouble come got principle selling o...

Pipeline for classifiers

Since our dataset includes both numerical features & NLP tokens, we’ll use a special converter class in our pipeline.

class Converter(BaseEstimator, TransformerMixin):
    ## Source : https://www.kaggle.com/tylersullivan/classifying-phishing-urls-three-models
    def fit(self, x, y=None):
        return self

    def transform(self, data_frame):
        return data_frame.values.ravel()

Pipeline for numeric features

numeric_features = ['lineLength_median', 'wordCountLine_median', 'chId_count', 'releaseYear']

numeric_transformer = Pipeline(steps=[('scaler', MinMaxScaler())])

Pipeline for tokens dereived from dialogues

categorical_features = ['posCredits']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])
vectorizer_features = ['cleaned_dialogue_<lambda>']
vectorizer_transformer = Pipeline(steps=[
    ('con', Converter()),
    ('tf', TfidfVectorizer())])

Now, we can combine preprocessing pipelines with the classifers. We will try 4 basic models:

  • Linear Support Vector Classifier
  • Logistic Regression Classifier
  • Naive Bayes Classifier
  • Random Forest Clasifier
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('vec', vectorizer_transformer, vectorizer_features)
    ])

svc_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', CalibratedClassifierCV(LinearSVC()))])  ## LinearSVC has no predict_proba method

log_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

nb_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', MultinomialNB())])

rf_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(n_estimators=120, min_samples_leaf=10, 
                                                            max_features=0.7, n_jobs=-1, oob_score=True))])

Fitting the preprocessing & classifier pipelines on training data

svc_clf.fit(X_train, y_train)
log_clf.fit(X_train, y_train)
nb_clf.fit(X_train, y_train)
rf_clf.fit(X_train, y_train)
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   MinMaxScaler())]),
                                                  ['lineLength_median',
                                                   'wordCountLine_median',
                                                   'chId_count',
                                                   'releaseYear']),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['posCredits']),
                                                 ('vec',
                                                  Pipeline(steps=[('con',
                                                                   Converter()),
                                                                  ('tf',
                                                                   TfidfVectorizer())]),
                                                  ['cleaned_dialogue_<lambda>'])])),
                ('classifier',
                 RandomForestClassifier(max_features=0.7, min_samples_leaf=10,
                                        n_estimators=120, n_jobs=-1,
                                        oob_score=True))])

Check results on the validation set

def results(name: str, model: BaseEstimator) -> None:
    '''
    Custom function to check model performance on validation set
    '''
    preds = model.predict(X_val)

    print(name + " score: %.3f" % model.score(X_val, y_val))
    print(classification_report(y_val, preds))
    labels = ['Male', 'Female']

    conf_matrix = confusion_matrix(y_val, preds)
    plt.figure(figsize= (10,6))
    sns.heatmap(conf_matrix, xticklabels=labels, yticklabels=labels, annot=True, fmt="d", cmap='Blues')
    plt.title("Confusion Matrix for " + name)
    plt.ylabel('True Class')
    plt.xlabel('Predicted Class')
results("SVC" , svc_clf)
results("Logistic Regression" , log_clf)
results("Naive Bayes" , nb_clf)
results("Random Forest" , rf_clf)
SVC score: 0.787
              precision    recall  f1-score   support

           0       0.77      0.82      0.79       190
           1       0.80      0.76      0.78       190

    accuracy                           0.79       380
   macro avg       0.79      0.79      0.79       380
weighted avg       0.79      0.79      0.79       380

Logistic Regression score: 0.768
              precision    recall  f1-score   support

           0       0.77      0.77      0.77       190
           1       0.77      0.76      0.77       190

    accuracy                           0.77       380
   macro avg       0.77      0.77      0.77       380
weighted avg       0.77      0.77      0.77       380

Naive Bayes score: 0.761
              precision    recall  f1-score   support

           0       0.74      0.80      0.77       190
           1       0.78      0.72      0.75       190

    accuracy                           0.76       380
   macro avg       0.76      0.76      0.76       380
weighted avg       0.76      0.76      0.76       380

Random Forest score: 0.721
              precision    recall  f1-score   support

           0       0.72      0.72      0.72       190
           1       0.72      0.72      0.72       190

    accuracy                           0.72       380
   macro avg       0.72      0.72      0.72       380
weighted avg       0.72      0.72      0.72       380

png

png

png

png

We see that Linear SVC performs the best classification with an accuracy & F1 score of ~79% !!

From the confusion matrix, we can see that out of the 190 male characters in the validation dataset, SVC model classified 155 of them correctly as males, and the remaining 35 incorrectly as females. Similarly, out of 190 female characters in the validation dataset, 144 were classified correctly & 46 classified incorrectly.

Logistic Regression & Naive Bayes classifiers are close at 77% & 76% accuracies respectively. These results are not close to state of the art, but are still pretty good.

Let’s now explore what features contribute the most to our classifiers performance through some model explainability techniques.

Feature importance

Creating a list of all features including numeric, categorical & vectorised features.

vect_columns = list(svc_clf.named_steps['preprocessor'].named_transformers_['vec'].named_steps['tf'].get_feature_names())
onehot_columns = list(svc_clf.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names(input_features=categorical_features))
numeric_features_list = list(numeric_features)
numeric_features_list.extend(onehot_columns)
numeric_features_list.extend(vect_columns)

Feature importance for Logistic Regression

lr_weights = eli5.explain_weights_df(log_clf.named_steps['classifier'], top=30, feature_names=numeric_features_list)
lr_weights.head(15)
target feature weight
0 1 oh 2.914220
1 1 really 1.598428
2 1 love 1.582584
3 1 hi 1.173297
4 1 said 1.161967
5 1 want 1.053116
6 1 like 1.020965
7 1 darling 0.992410
8 1 never 0.987203
9 1 child 0.975300
10 1 please 0.970647
11 1 god 0.941234
12 1 know 0.913415
13 1 honey 0.903731
14 1 school 0.897518
lr_weights.tail(14)
target feature weight
16 1 son -0.876330
17 1 good -0.897884
18 1 right -0.899643
19 1 chId_count -0.916359
20 1 fuck -0.996575
21 1 fuckin -1.049543
22 1 yeah -1.091158
23 1 hell -1.137874
24 1 got -1.162604
25 1 shit -1.162634
26 1 sir -1.201654
27 1 gotta -1.246364
28 1 hey -1.549047
29 1 man -2.188670

We see that dialogue keywords like oh, love, like, darling, want, honey are strong indicators that the character is a female, while keywords like son, sir, man, hell, gotta, yeah & most cuss words are usually found in the dialogues of male characters of the given Hollywood movies!

Let’s also try to visualize a single decision tree

We can training a single decision tree using the Random Forest Classifier.

m = RandomForestClassifier(n_estimators=1, min_samples_leaf=5, max_depth = 3, 
                           oob_score=True, random_state = np.random.seed(123))
dt_clf = Pipeline(steps=[('preprocessor', preprocessor),
                         ('classifier', m)])

dt_clf.fit(X_train, y_train)
results("Decision Tree Classifier", dt_clf)
Decision Tree Classifier score: 0.526
              precision    recall  f1-score   support

           0       0.54      0.37      0.44       190
           1       0.52      0.68      0.59       190

    accuracy                           0.53       380
   macro avg       0.53      0.53      0.51       380
weighted avg       0.53      0.53      0.51       380

png

While a single decision is a poor classifier with accuracy barely more than 50%, we see that bagging enough of such weak classifiers to form a Random Forest model helps us improve the model performance drastically! Let’s look at how the splits are made for a single decision tree.

'''
def draw_tree(t, df, size=10, ratio=0.6, precision=0):
    """ 
    Draws a representation of a decition tree in IPython. Source : fastai v0.7
    Have commented the function definition here due to a Jekyll build error related to Liquid objects.
    """
    s=export_graphviz(t, out_file=None, feature_names=numeric_features_list, filled=True,
                      special_characters=True, rotate=True, precision=precision, 
                      proportion=True, class_names = ["male", "female"], impurity = False)
    IPython.display.display(graphviz.Source(re.sub('Tree {',
       f'Tree size={size}; ratio={ratio}', s)))
'''

draw_tree(m.estimators_[0], X_train, precision=2)

svg

Here, the blue coloured nodes indicate their majority class is female while the orange colored nodes have a majority of male labels. The decision tree starts with a mixed sample, but the leaves of the tree are biased towards one class or the other. Most splits seem to be happening using dialogue tokens. For eg., in the above tree, if the tf-idf frequency of keywords think is > 0.1 & kid is > 0.03, the samples are classified as female.

Feature importance for the Random Forest model

eli5.explain_weights_df(rf_clf.named_steps['classifier'], top=30, feature_names=numeric_features_list)
feature weight std
0 oh 0.077328 0.027570
1 man 0.040615 0.030255
2 love 0.022664 0.026011
3 shit 0.019557 0.027031
4 said 0.017135 0.018359
5 lineLength_median 0.015757 0.017705
6 got 0.013712 0.018381
7 really 0.013227 0.018366
8 hey 0.012435 0.019035
9 good 0.012253 0.017000
10 look 0.011174 0.016425
11 right 0.009775 0.012608
12 sir 0.009673 0.018090
13 think 0.009661 0.014078
14 know 0.009660 0.012629
15 em 0.008794 0.019125
16 like 0.008730 0.011206
17 understand 0.008263 0.013995
18 want 0.008112 0.012310
19 yeah 0.007471 0.014095
20 get 0.007464 0.010990
21 would 0.007399 0.010665
22 chId_count 0.007074 0.010570
23 come 0.006845 0.010507
24 god 0.006782 0.011838
25 releaseYear 0.006504 0.009230
26 hi 0.006305 0.016955
27 one 0.006265 0.009646
28 gotta 0.006075 0.014320
29 child 0.005858 0.013265

We see that the median length of a dialogue, total no of lines (chId_count) & movie release year are important features along with the tokens extracted from the character’s dialogues for the Random Forest model!

Next Steps

Some possible ways to further improve the classifier performance could be:

  • using bi-grams or tri-grams for dialogue tokens
  • Adding features related to sentiments extracted from dialogues
  • Adding a feature that measures the level of objectivity or subjectivity of a dialogue
  • hyper-parameter tuning of our model parameters
  • trying out XGBoost or neural network models

Still, our current best model (Linear SVC) can classify roughly 4 out of 5 movie characters (79% accuracy) correctly using the dialogues they speak, and some movie metadata like release year and position of character in the movie credits. We can safely say that our model is able to capture the gender specific bias in the characters of Hollywood movies.

If you would like to play around with the code, the complete Jupyter notebook is available here on Kaggle.