CODECADEMY WALKTHROUGH: FINAL PORTFOLIO PROJECT

RUN TO THE HILLS

16 min readMay 16, 2021

The final project is here!
This is my version of the final portfolio project for Codecademy’s data scientist path. Unlike previous projects, here I had to select the subject, search the dataset and defining an objective to reach.
So much freedom can be disorienting and I was short on inspiration.

During the last year I used running workouts to escape monotony, relax my mind and widen my vision.
And so I did this time…

Unsurprisingly it worked: running is not only the mean I used for getting inspired, but a part of the solution itself.

I’m working on predictive maintenance lately: this field of industrial engineering uses machine learning models for estimating equipment normal behavior and detecting signs that may anticipate failures.
Yes, I could search for specialized datasets to apply those methods on industrial equipments. But why working on the same subjects I deal with during my normal working hours?

I was in the middle of my usual track. I could hear the sound of my feet on the ground and the dull thudding of my own heart.
The human body has some similarities with machines: every part of our organism has to work properly if we want to achieve our best performance. Like machines, it is possible to relate some biometrics to the way our body reacts to external factors.
… Maybe I can apply predictive maintenance to myself, literally!

Finally I had an idea to work with, but I still had to collect some data… suddenly my watch happily bleeped as I reached my 5th km of run. Can I export Apple’s Health Kit data to Python?

I bet you guessed that the answer is yes.

Exporting data from Apple Health Kit

Instruction to download your Health Kit data in xml format are available on Apple’s support.
A zip archive is exported from Health app: it contains several files. Inside the project repository (link at the end of this article) there is a notebook named Import Apple Health.ipynb.
This file focuses on converting readings stored in export.xml (found inside the zip archive) into Python objects. Conversion from xml is not straightforward, it is something I had to figure out.
You can convert your own data with Import Apple Health.ipynb, which will save those objects to HKdata.pickle file.
A Python module named healthkit.py defines the classes used for converting the xml. Several functions are also defined there: if you want to understand how I handled and cleaned Health Kit data, or if you are interested in exploring this dataset (as I did during my exploratory analysis, which is not described in details here) you should start by reading Import Apple Health.ipynb and healthkit.py.

Final Portfolio Project.ipynb (the notebook that follows the steps described in this story) will load HKdata.pickle directly: I will not show in details how Import Apple Health.ipynb works, but anyway I will reference some cleaning steps carried out there.

Since Import Apple Health.ipynb is meant to be a generic tool, it loads every available workout from Health Kit database and not only running activities. You can use it to further explore what available in Health Kit.

Note: HKdata.pickle is larger than the 100 MB limit imposed by GitHub. For this reason you have to unzip HKdata.zip before moving on.
If you want to use data from your own Health Kit, check the repository read-me.

Import statements

We start by importing the packages we need.

Note: if you want to export plotly figures to chart-studio you can set export_to_chart_studio to True.
Before doing it you should subscribe to a free chart-studio account and properly set your credentials (note that chart-studio is already included in CC-DSpathFP39 environment).

Load Apple Health Kit data

Data elaborated by Import Apple Health.ipynb are imported by Final Portfolio Project.ipynb through pickle.load.
The file HKdata.pickle contains a list of wokouts. Each item of the list is a dictionary with the following keys:

workout — workout object with the following properties:

activity_type
duration
duration_unit
total_distance
total_distance_unit
total_energy_burned
total_energy_burned_unit
source_name
source_version
device
creation_date
start_date
end_date

records — dictionary of records objects that fall between workout start and end dates.
Dictionary keys are records types (e.g. running distance). Each record has the following properties:

rec_type
source_name
source_version
device
unit
creation_date
start_date
end_date

timeseries — similarly to records, it is a dictionary of pandas time series that fall between workout start and end dates.
Once again each key of the dictionary is a record type (e.g. running distance). The difference with records is that duplicated timestamps are removed and only data sources selected in Import Apple Health.ipynb (e.g. iPhone, Apple Watch, Nike Run Club) are allowed.

units — a dictionary of units of measurement, each key is a record type.

We are interested in running sessions, so only objects that belongs to the HKWorkoutActivityTypeRunning type are stored in a dedicated variable called HK_run.
Types lists for workouts and records are then displayed: as expected only running workouts are available after this step.

Features and target preparation

Before training a model we need to check data integrity and some preparative action shall be performed.

Incomplete workouts

It happened, for some unknown reasons, that my watch did not recorded properly the heart rate during some running sessions.
Let’s find out which items do not have an heart rate time series associated (available time series for each item are listed below):

Available time series for workout sessions with missing heart rate data:

ITEM 0: Step Count, Active Energy Burned, 
        Distance Walking Running, Flights Climbed
ITEM 1: Step Count, Active Energy Burned, 
        Distance Walking Running, Flights Climbed
ITEM 2: Step Count, Active Energy Burned, 
        Distance Walking Running, Flights Climbed
ITEM 3: Active Energy Burned, Basal Energy Burned, 
        Flights Climbed, Distance Walking Running, Step Count
ITEM 28: Active Energy Burned, Basal Energy Burned, 
         Apple Stand Time, Distance Walking Running, 
         Apple Exercise Time, Step Count

Heart rate is missing from these items!
During the data cleaning step we will remove every workout that do not have all the required time series.

Data cleaning

After defining the selected feature and target variables, unneeded time series are removed from HK_run. A clean copy of HK_run is stored in model_Data.
Python works with shallow copies when composed element (e.g. list of dictionaries) are assigned with the = operator.
By running model_data = HK_run, the following operations would affect also HK_run (you can try it by removing the copy.deepcopy statement). We need a deep copy of HK_run because we don't want to alter it when popping unused time series.

As anticipated we would like to predict heart rate by means of the distance covered during the running session. We will see that some useful information can be derived from records of type HKQuantityTypeIdentifierDistanceWalkingRunning.

Available workouts after data cleaning process: 56

Let’s see how do those time series look!
The following code plots time series from the first workout and since we are leveraging plotly every figure is completely interactive.

print_selected_workout is a function that takes care of formatting and displaying plotly figures, we will use it again.

Plotly zoom feature enable us to see that each time series has different timesteps:
we have to resample those time series to have a common time vector.

Duplicated timestamps and anomalies in time aggregation

When I started playing with those timseries I noticed, with disappointment, that some timestamps are duplicated. Why in the name of Jupyter?

I then discovered that each application (and/or weareable) stores its records independently. This may result in the same information repated twice.
If we dig into the heart rate records for the first workout, we find that the first duplicated timestamp is:

First duplicated heart rate record: 2020-06-05T18:09:37+02:00

Now we can print those records to see what’s happening (print uses the __str__ method from HKRecord class, see healthkit module).

RECORD - 2020-06-05 18:09:38+02:00

HKQuantityTypeIdentifierHeartRate from Apple Watch di Michele
            ====================================================================

From 2020-06-05 18:09:37+02:00 to 2020-06-05 18:09:37+02:00

Value: 88.000 count/min

            

RECORD - 2020-06-05 19:05:05+02:00

HKQuantityTypeIdentifierHeartRate from Nike Run Club
            ====================================================================

From 2020-06-05 18:09:37+02:00 to 2020-06-05 18:09:37+02:00

Value: 88.000 count/min

In this case my Apple watch and Nike Run Club saved the same record twice.

Let’s explore a bit further: the following code shows running distances above 1.8 km for the first workout.
Why 1.8 km? You will see…
I have caught what below while developing the project, and then I investigated further.

RECORD - 2020-06-05 18:28:04+02:00

HKQuantityTypeIdentifierDistanceWalkingRunning
from iPhone di Michele Alberti
            ====================================================================

From 2020-06-05 18:17:02+02:00 to 2020-06-05 18:27:02+02:00

Value: 2.151 km

            

RECORD - 2020-06-05 18:58:05+02:00

HKQuantityTypeIdentifierDistanceWalkingRunning
from iPhone di Michele Alberti
           ====================================================================

From 2020-06-05 18:47:02+02:00 to 2020-06-05 18:57:03+02:00

Value: 1.944 km

Do you notice the unusual big duration of each record (10 minutes)?
This is not the typical timestep from the Apple Watch, which is in fact an istantaneous reading.
Different sources (the iPhone in this case) aggregate records in a different way, generating the following anomalies in a mixed-source time series:

Those step-high points are data from the iPhone and they match the sum of all previously recorded values from the Apple Watch.
This is an aggregated reading.

You can recreate the figure above by changing the ts_source to None in Import Apple Health.ipynb.
Deactivating source selection when calling healthkit.build_workouts_timeseries will gather all records without removing differently aggregated sources.

Since I noticed these issues during my explorative workflow, I decided to add some options in build_single_workout_timeseries (see healthkit.py) to remove those duplicated records and data sources from each time series at conversion time.
I'm showing here how I discovered these issues, but I will not do anything on the existing time series. Why?
Everything has been already properly managed by the healthkit.py conversion functions, no extra-work for us...

Note: notice that I always used model_data[0]['records'] instead of model_data[0]['timeseries'] when showing those anomalies. The reason is that the 'timeseries' key contains pandas Series already cleaned-out by healthkit.py, while the 'records' key has the original data.

Features engineering

What is the best way to describe a body in motion?
Using just the distance we covered while running seems reductive.
Fundamentals of physics teach us that main properties describing a body in motion are position, velocity and acceleration.
These physical quantities would give us insights on:

Distance from the starting point. The longer you run, the tired you become.
Cruise velocity, which is connected to air resistance, perseverance of motion and energy stored in a moving object. The faster you run, the harder it get to increase your velocity. At the same time, higher velocities mean that more kinetic energy is stored in our body: I bet you can imagine the difference between getting in the way of someone walking and a runner.
Acceleration is proportional to the force applied for changing the state of a moving object. Acceleration is a good indicator of the effort you are undergoing, like when you are facing a hill, for example.

All these features could be useful in estimating the heart rate.

Position

We start by using a cumulative sum for transforming the running distance into a position (distance from starting point).

Velocity

Velocity is the first derivative of position. We have to calculate the time differential, while the position differential is already stored in HKQuantityTypeIdentifierDistanceWalkingRunning time series.

Acceleration

Acceleration is the second derivative of position or the first derivative of velocity.
In this case we shall calculate differentials for both time and velocity.

Then all values are stored inside the engineered_features dataframe.

Resampling time series

We can now resample each time series, workout by workout.
For each workout in model_data we need to find the new time vector on which we are resampling. We use the union method of pandas index for reching this goal. Then we use the reindex method on each time series.
After this calculation all time series for a specific workout will show the same time vector, and missing values will be filled by interpolating between valid observation.

During the first iteration (hence for the first workout), print statements will display how time vectors change after the reindexing.

Note: inside the for-loop that reindex time series, engineered features (position, velocity and acceleration) are updated with a different statement from non-engineered ones (heart rate). The reindex algorithm applied to both groups is the same (interpolation) but you can apply a different algorithm for each group (like padding).
See pandas documentation for reindex and interpolate.
There are three commented lines that reindex each time series by using the intersection between time indexes. Uncomment them to see the difference.

*** BEFORE REINDEXING ***
Element in <Position> time vector: 1249
Element in <Velocity> time vector: 1249
Element in <Acceleration> time vector: 1249
Element in <Heart Rate> time vector: 655
====================================
Element in new time vector: 1651

*** AFTER REINDEXING ***
Element in <Position> reindexed time vector: 1651
Element in <Velocity> reindexed time vector: 1651
Element in <Acceleration> reindexed time vector: 1651
Element in <Heart Rate> reindexed time vector: 1651

These plots are really similar to previous ones. This is a good result since I do not expect the retiming to affect the overall shape of time series. In fact we are just adding some data points without altering the trend.

Final data frame

Complete dataset

Now we finally convert the selected time series to a dataframe.
During my workouts I noticed that Apple’s watch may starts recording the heart rate a bit later than other time series.
In addition, position and velocity are unknown during the first minutes: this fact is due to both data recording (the first value in running distance time series is usually recorded later than other time series) and features engineering (pandas’s diff, reindex and interpolate create those NaN values).
We can drop incomplete rows to avoid problems later on.

Train-test split

We are ready for splitting the dataframe into train and test sets.
Let’s store the name of X (features) and Y (targets) columns into dedicated arrays.

FEATURES:	['Position', 'Velocity', 'Acceleration']
TARGETS:	['Heart Rate']

Project scope

I would like to build a working analytic to be used as the calculation engine for a run-o-matic, the perfect companion for your running workouts. We need a solid model for reaching this objective.
The idea is to find one that is suitable for predicting physiological parameters of athletes (heart rate in our case) by looking at their performance (position, velocity and acceleration).
We will train several machine learning models to see which one gives the best result.
In a future project I could use that model to keep track of how I perform by looking at models residuals.
A sudden worsening in terms of heart rate should be evident when compared to a model trained on my recent workouts.

For example, if in certain conditions the trained model predict by far a lower heart rate than the actual one, this could be related to:

model error that could be corrected with additional data/training
a real problem or an underperformance due to external causes

Model selection

Let’s train some model, but before starting…

Dataset reduction based on similarity

Since we will train models that do not scale well with an high number of records (SVR for example), I decided to reduce the number of records we have to analyze.
The idea is to keep only non-similar records. But how?
Every row is a vector, and similarity is measurable with the distance between each pair of rows (this is in fact the distance between two vectors). One of the most famous similarity evaluation method is cosine similarity which considers that two vectors are similar if they point in the same direction.
Unlike cosine similarity, euclidean distance takes into account also the difference in magnitude of vectors.
This means that with euclidean distance a long running session and a short workout are different (at least to some extent), because one component of the vector (i.e. the walking/running distance) is changing.
For cosine similarity it might not be same if the two vectors are heading in the same direction, even if the magnitude is different.
We will calculate the pairwise euclidean distance for each couple of records in our dataset, since I suspect that the magnitude of each vector is an important element when deciding if a row has to be excluded from the trimmed dataset.
If the distance between two vectors is under a certain threshold only one element of the pair is kept.

Length before removing similar vectors: 38260
Length after removing similar vectors: 7330

Now we are ready for training our models.

Ridge Regressor

A really simple pipline should do:

scaler
regressor

For the following models we will then change only the regressor step (for this reson a support function named apply_grid_search is declared).
Let’s start with a simple linear model.

GridSearchCV will tune model's hyperparameters (alpha).

Note: remember that sklearn optimization functions always maximize the score, so scores that need to be minimized are negated.

Best parameter (CV score=-44.839):
{'regressor__alpha': 10.0}
============================================

SCORE ON TEST SET: -21.449

Let’s see if we can improve this result with other algorithms.

K-Nearest Neighbors (k-NN)

Same pipeline with a different regressor.

GridSearchCV will tune model's hyperparameters (number of neighbors, p index and weights).

Best parameter (CV score=-11.598):
{'regressor__n_neighbors': 4, 'regressor__p': 1, 'regressor__weights': 'uniform'}
============================================

SCORE ON TEST SET: -9.400

Support Vector Machine Regressor (SVR)

Once again, hyperparameters tuned by GridSearchCV are different from the previous model (C and epsilon))

Best parameter (CV score=-13.423):
{'regressor__C': 1000.0, 'regressor__epsilon': 10.0}
============================================

SCORE ON TEST SET: -9.757

Neural Network Multi-layer Perceptron (MLP)

My first neural network… I was so tempted to try it.

GridSearchCV could tune a lot of parameters here. We do not want to wait forever, the param_grid will focus on a small group.

Best parameter (CV score=-15.172):
{'regressor__alpha': 0.01, 'regressor__beta_1': 0.9, 'regressor__beta_2': 0.1, 'regressor__hidden_layer_sizes': 150}
============================================

SCORE ON TEST SET: -10.197

Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.

I know, it could be tuned better than that (I intentionally left the warning message).
But, hey, we have a nice k-NN model to work with!

Finalize the best model

So, k-NN is the winner!
We stored the full train set in full_train_df: since k-NN regressor is pretty fast to train we can finalize it by using the full training set.

Best parameter (CV score=-8.002):
{'regressor__n_neighbors': 5, 'regressor__p': 1, 'regressor__weights': 'uniform'}
============================================

SCORE ON TEST SET: -7.771

We improved our k-NN model a bit more (both in term of score and number of training elements).

Results

Now we are ready to display the residual distribution plot of test and training sets.

The majority of residuals is between +20 and -20 count/min.

The following code build an interactive widget that plots actual values, estimates and residuals on plotly figures (you should open the notebook to play with it). You can select the workout by changing the Workout index value.
Results are good, even if some workouts are not well represented as others.

Here the first workout is shown.

Conclusions

Good news: we build a model that seems to work fairly well!
Using position, velocity and acceleration to describe the heart rate is showing good results: residuals are typically in a range between +20 -20 count/min.
Heart rate is between 160 and 180 count/min as shown in the following figure.

It means that residuals are roughly 10% of the typical heart rate value we see during a running session.
I find this a good result.

If you explore each workout by means of the interactive widget, you should notice that heart rate trends in workouts are well represented, with only some rare exceptions.
I frequently run on two distinct loops, but on 2020 I also had some trails in different locations, so the dataset is diversified to a certain extent.

Next steps

k-NN model

Introducing running sessions from different places would be useful to have a model that is not track-specific. I’m not worried about the fact that the model describes my running performances because this is part of the project’s scope.
It has to be tuned on a particular runner to be useful in highlighting variations of performances.
I would be curious to see it trained and tested by other runners.

I experimented with two scoring metrics: root-mean-square error (RMSE) and coefficient of determination (R²). Since RMSE penalize big errors I decided to use this scoring metric (you don’t want to estimate a misleading heart rate and think that it’s way higher or lower than it should).
However it may worth to test some additional scores.

Some models (like SVR) seem promising but I didn’t played too much with them.
Mainly because k-NN is simple, it is reasonably fast to train (at least for this application) and gave good results.
Maybe exploring those models a bit further could lead to a better overall description than the one resulting from k-NN.

I adopted a StandardScaler transform in my training pipeline because some models required the data to be standardized.
k-NN is not among them: i think that also MinMaxScaler worth a try if k-NN is selected as describing model.

Overall analysis

It would be possible to set thresholds on residuals, based on 10% of the typical heart rate value (e.g. ±18 count/min).
When the residual (which is the difference between the actual value and the estimate) stays continuously above or below those thresholds an alert could warn the runner that something is going on.

Be aware that those kinds of alerts do not necessarily warn that your heart rate is wrong. They are simply notifying that there is something different with respect to historical values that are part of the training dataset.
What happens if you run on a totally new track?
What if the features we used for describing the hearth rate are not enough?
I expect that the model should be able to predict something meaningful, but its precision might worsen.

This approach is really similar to what I have seen in predictive maintenance and I think it could be applicable to athletes, with good results.

The GitHub repository for this project is available here.
If you want to see more estimates-residuals plot you should open Final Portfolio Project.ipynb: you will find a widget for easily displaying them.