Get ready for the third act of this walkthrough for Codecademy’s Data Science career path.
We will apply machine learning algorithms to the OkCupid Date a Scientist dataset for completing the machine learning portfolio project.

Photo by Önder Örtel from Pixabay

Data are provided by Codecademy as part of the project in a file named profiles.csv.
The dataset provided has the following columns of multiple-choice data:

  • age: continuous variable of age of user
  • body_type: categorical variable of body type of user
  • diet: categorical variable of dietary information
  • drinks: categorical variable of alcohol consumption
  • drugs: categorical variable of drug usage
  • education: categorical variable of educational attainment
  • ethnicity: categorical variable of ethnic backgrounds
  • height: continuous variable of height of user
  • income: continuous variable of income of user
  • job: categorical variable of employment description
  • offspring: categorical variable of children status
  • orientation: categorical variable of sexual orientation
  • pets: categorical variable of pet preferences
  • religion: categorical variable of religious background
  • sex: categorical variable of gender
  • sign: categorical variable of astrological symbol
  • smokes: categorical variable of smoking consumption
  • speaks: categorical variable of language spoken
  • status: categorical variable of relationship status
  • last_online: date variable of last login
  • location: categorical variable of user locations

And a set of open short-answer responses to :

  • essay0: My self-summary
  • essay1: What I’m doing with my life
  • essay2: I’m really good at…
  • essay3: The first thing people usually notice about me…
  • essay4: Favorite books, movies, show, music, and food
  • essay5: The six things I could never do without
  • essay6: I spend a lot of time thinking about…
  • essay7: On a typical Friday night I am…
  • essay8: The most private thing I am willing to admit
  • essay9: You should message me if…

I already have a faint idea of what I’m interested in: I would like to make prediction on people age by using essays available on users profiles.
For this reason I will focus the exploratory data analysis on tidying users’ ages and essays.

As usual, we start by importing all the relevant packages.

Let’s now load OkCupid users profiles and assign them to a pandas dataframe.
The first row of profiles’ table is visible below.


We may improve profiles by changing the essays columns names. Something related to the subject of each essay would be more intuitive.

In addition some HTML charachters are present inside texts: we can remove them with RegEx.
I noticed that the RegEx expression I chose is sensible to newlines charachters, which may prevent HTML tag removal if \n is inside the tag itself. Let's remove them before working on those pesky HTML tags!


Texts are good to go now.
This dataframe includes 59946 users, a considerable amount.

Before moving to the age distribution let’see if we have missing values in essays.


Now we know that it is better to remove NaNs if we are going to train a model on this dataset:
some people did not completed all the essays.
We will complete this step in Word Vectors and Age Labels.

Texts are finally clean, let’s close this exploratory analysis by taking a look to age distribution.

MEDIAN: 30.0

Age median is 30 and there is a peak around 25. We see also that the distribution is right skewed, this means that in our dataset a reduced number of people is more than 40 years old.

Looking at the far right of the ditribution plot, it seems that we have some people over 100 years old. This is unexpected!


Since these profiles are not meaningful, we can remove them.

MEDIAN: 30.0

I doubt that an OkCupid profile is accepted without the age field properly filled in, but it worth a check.


As expected, there is no need to handle missing ages.

Writing skills vary with age: we change our style, we express different interests and typically we use different espressions as we growth older.
Is a machine learning model able of recognizing these age-related differences and classifying the age of each user?

Just to keep it simple I will apply my model to only one essay.
The basic assumption is that young and old people spend their friday evening in a different way and texts should reflect this difference.
For this reason, I will focus my analysis on essay n. 7:

On a typical Friday night I am…

My goal is to assess if this approach could work: take this as a first attempt rather than as an extended analysis.

The Spacy package gives access to pre-trained NLP model. We will use the en_core_web_lg model for transforming words inside text in numerical vectors.

Note: do not forget to download en_core_web_lg before proceeding, see instructions inside from the GitHub repository (link at the end of the story).

After loading the model, Friday essays are isolated and rows with missing data are removed.
Finally word vectors are evaluated (be patient, this may take some minutes).

Each essay is now described by an average word vector.
We will find if this description is sufficient for correctly classifying users’ age.

An age_label is created as target for classification:

  • if age is less than 45, the class is young
  • if age is over 45 the class is old

I do not think that all the people over 45 years of age are old in a real sense. No judgment here, the threshold is set to be as closer as possible to common sense and to include a meaningful number of individuals in each class.


From the last table we can see that the dataset is strongly imbalanced toward young people, I would expect this to be a problem for the machine learing algorithm.

We are ready to build a modelization pipeline. It includes the following steps:

  • scaler: a StandardScalefor removing scale effects
  • analyzer: a PCA (principal component analysis) to reduce features dimensionality
  • classifier: a LogisticRegression model for associating vectors to age labels

I have selected the LogisticRegression because is fast to train.
The StandardScale step seems to improve predictions on test set (I have done some quick comparisons, this may worth a dedicated investigation).
I then decided to add the PCA to reduce the dimensionality of vectors build by the large Spacy model (en_core_web_lg). These are 300 elements long!
The PCA is set to keep only the principal components that explain the majority of the variance (it is tuned by GridSearchCV).

Before training the pipeline, one third of the dataset is stored for testing the model later on (with train_test_split).

GridSearchCV will train several LogisticRegression classifiers on different sub-datasets and with different hyperparameters, in order to find the best one according to our scoring metric.
I selected n_components (from PCA), C and class_weight (from LogisticRegression) as hyperparameters to be optimized (see Scikit documentation for details).

scoringis set to f1_weighted instead of the default accuracy because we have an imbalanced dataset, where young people are dominant. A simple accuracy would reward models that result often in a young classification, even if wrong.
A simple model that predict always a young would probably score a good accuracy (see the Conclusions section).
We want to penalize misclassification instead, and the score metrics we have selected attempts to reach that goal (see Scikit Learn documentation and this tour of evaluation metrics).

Tuning the class_weight hyperparameter is part of the same objective: in fact it assign different weights to errors on labels prediction. We are thus emphatizing the importance of the old class.

Time to train!
Please be patient, this will take a couple of minutes…

Best parameter (CV score=0.847):
{'analyzer__n_components': 0.86,
'classifier__C': 0.25,
'classifier__class_weight': {'young': 1, 'old': 3}}

The score seems good, we will see how good after the testing phase.
We can see the effects of PCA dimensionality reduction with the following code.

Dimensions before PCA:	300
Dimensions after PCA: 89

Do you remember the sub-set we have kept for testing the model?
It’s time to use it!
Score and confusion matrix are shown below.

Test score: 0.847

As we expected the fact that the dataset is not balanced is problematic. The model tend to label old people as young.
We see also that young people is sometimes labeled as old.

The classifier match the dataset, at least partially.
It seems also that it is not completely able of understanding the difference between essays written by young and old people: probably, vectorized descriptions of some essays are really similar between the two classes. Styling differences in written texts for the two groups may be subtile in some cases, making harder for this model to understand them.
It may happen also that some sentences are used frequently by both classes because they are generic and not age-related. This may lead to misclassifications too.

But is our classifier better than a trivial one, a model that predicts always the most frequent young label (called Zero Rule or ZeroR)?

Let’ use Scikit’s DummyClassifier to output the most frequent label and evaluate the weighted f1 score of our model on the test set.
We could achieve the same result by passing to f1_score a simple list where each item is a 'young' string, instead of using the DummyClassifier.predict method.

ZeroR f1 score: 0.836

The model we trained achieved slightly higer score of 0.847, a small improvement over the Zero Rule.

At first I experimented a couple of different classifiers and several scoring functions before adopting the simple pipeline described above.
I wanted to test if the idea of classifying the author based on age-related textual differences, could work.
For this reason I selected a classification algorithm that is fast and simple, I then played with scoring functions and hyperparameters to avoid biases towards the young class.

Since there is a small improvement in comparison to the Zero Rule, it may worth investigating different options:

  • Experimenting with different classifiers (e.g. kNN or other time intensive classifiers).
  • Using a different scoring metric (I tested also balanced_accuracy which seemed to predict more old labels correctly but with higher misclassification on the young ones).
  • Adding essays or features to help the model in differentiating the two classes.
  • Adding a middle class with an intermediate age range.
  • Using oversampling or undersampling (bootsrapping) to remove class imbalances (limits of this approach are explained here).

The github repository for this project is available here.

Data adventurer in disguise.