CODECADEMY WALKTHROUGH: LIFE EXPECTANCY & GDP PROJECT

Michele Alberti
8 min readFeb 12, 2021

I started the Data Science path on Codecademy a few months ago and I am enjoying this new learning experience.

Now I am facing a new exciting step: posting my work on Medium is the conclusive act for this project.

Photo by Nattanan Kanchanaprat from Pixabay

This walkthrough is the outcome of the data vizualization portfolio project, it is available also in the form of Jupyter notebook on github (link at the end of the story). I kept the same structure of the notebook, therefore code and descriptions will follow the same cell-like pattern.
The data given for developing the project are life expectancy (in years) and gross domestic product (GDP, in dollars) for some countries, taken from the World Health Organization (WHO) records.
For better understanding the meaning of each variable it is suggested to look at Wikipedia’s pages for GDP and life expectancy.

Even if data visualization is the scope of this project I added also some statistical tests, another part of the Data Science learning path.

Before stating the scope of the project it is important to look at the available data.

Exploratory Data Analysis

Data are provided by Codecademy as part of the project in a file named all_data.csv: it is time to get acquainted with our dataset!

Import Statements

Before loading data from all_data.csv relevant packages have to be imported.

Load CSV

Now we are ready for loading the csv in WHO_df dataframe with pd.read_csv. Renaming Life expectancy at birth (years) to a more friendly Life_Exp is a good way to improve code readability.
A column with GDP in trillions (10¹²) is then added to improve readability of GDP values.

png

Descriptive Statistic

We can then look at some descriptive statistical values.

png
png

There are six countries with 16 years of records (from 2000 to 2015). Average life expectancy is 72 year but looking at quartiles it is noticeable that the average value is probably affected by one or more countries with a low life expectancy.
Also GDP distribution has at least one country with very low values (see the min row), being the interquartile range in between $0.173tn and $4.068tn.

Overall View

The first pairplot confirms previous statements.

png
Overall pairplot

A second plot without Zimbabwe highlights differences between other nations.

png
Same pairplot without Zimbabwe

Looking at variables densities and paired scatterplot it is possible to notice that:

  • Zimbawe had a profound crisis which affected GDP and life expectancy (more info here): the way its life expectancy varied is unique in this dataset. There is a downward trend followed by a great improvement.
  • America and China show the same positive trend, albeit China’s one is delayed by some years of difference.
  • Even if Germany has a lower GDP, its life expectancy is better than America and China ones: this fact suggests that a whealthier economy may entail an improved life expectancy, but other factors may have an effect.
  • Chile is another country with a low GDP but a relatively high life expectancy.
  • Mexico shows intermediate values both in term of GDP and life expectancy.
  • It seems that China’s economy is flourishing but the healthcare is not at the same level of other nations.

It seems that a florid economy improve overall health conditions, but other factors may have an important influence.

Let’s compare Germany and Chile with America and China.

Germany for example developed an efficient national insurance based healthcare system, without the imbalances of the American system.
USA healthcare is more influenced by economic status: access to better medical cares is easier for wealthy people.
The German system has some guarantees to tutelate also people with less financial means (a comparison is available here).

It is impressive that Chile records a better life expectancy than USA: it is unlikely that Chile has developed a so much better infrastructure (although Chilean healthcare is historically one of the best in Latin America).
Probably cultural differences, eating habits, lifestyle, education, etc. concurs with GDP in defining the life expectancy.

Project Scope

We would like to asses if there is a correlation between GDP and life expectancy, and we want to find out how strong this relationship is.
In addition we would like to know if life expectancy is statistically different between wealthy countries and the ones with a lower GDP.

We expect that a correlation is proven, although it may be weaker than we would imagine.
Statistical significance of life expectancy is less predictable since some nations contradict the hypothesis that an higher GDP is connected to a better life expectancy.

Correlation between GDP and life expectancy

Since GDP and life expectancy are non-normal, the Kendall τ correlation coefficient (kendalltau) is used for measuring correlation.
Normality is tested with normaltest from scipy.stats.

png

After verifying that GDP and life expectation have non-normal distributions, Kendall τ is evaluated.

KENDALL TAU: 0.370
p-value: 9.925e-08

The Kendall τ highlights a correlation between the two variables.
Values of τ close to 1 mean strong agreement, while values close to -1 indicate strong disagreement.
A moderate agreement is thus found between GDP and Life_Exp: the following scatter plot agree with this conclusion.

png

An higher GDP seems connected to a better life expectancy, but this is not the only influencing factor.
Chile has a lower gross domestic product compared to USA or China, nevertheless it shows a good life expectancy.
Chile’s life expectancy is comparable with Germany’s one, even if Chile’s GDP is 10 times smaller.
This fact strengthen the result of a moderate correlation.

Note: GDP logaritmic scale improve readability, to see how it influences the visual impact try commenting plt.yscale('log') in the Jupyter notebook.

Hypothesis testing: wealthy countries

Does it exist a statistical evidence that nations with higher GDPs have an higher life expectancy?
This alternative hypothesis is tested by means of a 2 sample t-test:

  • NULL HYPOTHESIS: nations with an higher GDP do not have a greater life expectancy
  • ALTERNATIVE HYPOTHESIS: nations with an higher GDP have a greater life expectancy

This test willl compare the following groups:

  • high_GDP: United States of America, China, Germany
  • low_GDP: Mexico, Chile

Zimbabwe is not included in the low_GDP group because its economy and healthcare are affected by the outcome of a recent civil war. This condition is not comparable with other nations from the provided dataset.
Adding this country to the low_GDP group would probably bias the result of the test towards wealthy nations: the previous paragraph analyzed the whole dataset, now we want to focus on countries with similar external/social conditions, in order to have a meaningful comparison.

16 years are enough for laws to take effect and changing, at least partially, some social systems of a nation.
In order to avoid time-related effects, we focus on recent times: only data from 2015 are considered.

png
png
2-sample t-test is not significant:
there is no evidence that nations with an higher GDP have an higher life expectancy.

Significance level: 0.05
p-value: 9.37e-01
t-statistic: 0.09

The alternative hypothesis is rejected. The following boxplot confirms this result.

png

The two boxplots are really similar in term of interquartile range and median. Whiskers of wealthy countries are more extended than the ones of countries with lower GDP: this is a consequence of Germany and China life expectancies (which show respectively high and low values).

Conclusions

The gross domestic product is related to the overall wealth generated by the economy of a country.
A florid economy attracts investors: the consequent availability of funds may lead to a better healthcare.
A constant flux of money is not sufficient though, authorities have to set a clear investment strategy for the system to work efficiently. If a nation focus properly its efforts and investments, it could achieve better performances than wealthier nations, in terms of medical cares (e.g. Germany).
In addition other factors can undermine the effectiveness of a good infrastructure (political stability or lifestyle for example) with the result that unexpected countries can outperform wealthier nations just by exploiting a favourable context (this may be the case of Chile showing better results than USA in term of life expectancy).

Life expectancy and GDP seems correlated but there is much more to consider when looking at how they relate.

In fact the hypothesis test resulted in a not-significant difference between two comparable groups of nations (low GDPs vs high GDPs).
It worth mention that all nations included in Codecademy dataset are relatively similar, with Zimbabwe being the only exception.
It is the only African country in our dataframe.

Take a look at the following image (from Wikipedia) with world’s GDPs for 2014:

GDP_Wiki

Africa has the highest concentrations of countries with really low GDPs (and Zimbabwe is among them).
We should have had more countries similar to Zimbabwe to have a realistic picture of the relationship between GDP and life expectancy. This is the reason why Zimbabwe has been removed from the low_GDP group in "Hypothesis testing: wealthy countries".
We decided to see if a difference existed between countries that are similar from a socio-political point of view, instead than highlighting the (obvious) effect of Zimbabwe over the low_GDP group (you can try to run the hypothesis test with Zimbabwe included in the low_gdp_df to see if the result changes).

We have seen that no evidence of statistical difference exists between the selected groups in term of life expectancy: note that Germany, China, USA, Chile and Mexico are relatively similar in term of color in 2014 GDP map, even if the difference in therm of value is meaningful (German GDP is twice as high as Mexican one).
We do not know if including nations with an higher gap in GDP may highlight a relevant difference in term of life expectancy (like Zimbabwe case suggests).

The following violin plot gives an idea of the divide between countries with a weaker economy like Zimbabwe and the rest of the world.

png

To obtain more significative results we should broaden the analysis by including an ample variety of countries, at least by increasing the number of nations which are really different from the strong economies that are part of this dataset.

The github repository for this project is available here.

--

--