Pearson Correlation with Python

This blog post is dedicated to what I learnt form the Coursera course on Data Analysis Tools: Pearson Correlation, provided by the Wesleyan University. The course addresses correlation analysis using a Python script. To try it by myself, I decided to assess the relationship between 3 development indicators (income per person, life expectancy, policy score, and urban rate) and HIV rate.

Here is the code I came up with:

# -*- coding: utf-8 -*-
“””
Created on Sat Jan 9 21:24:33 2016

@author: DEGNINOU
“””

import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt

#Read gapminder.csv data file
gapmind = pandas.read_csv(“gapminder.csv”, low_memory=False)

#Set PANDAS to show all columns in DataFrame
pandas.set_option(‘display.max_columns’, None)
#Set PANDAS to show all rows in DataFrame
pandas.set_option(‘display.max_rows’, None)

# bug fix for display formats to avoid run time errors
pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#setting variables I will be working with to numeric
gapmind[‘incomeperperson’] = gapmind[‘incomeperperson’].convert_objects(convert_numeric=True)
gapmind[‘polityscore’] = gapmind[‘polityscore’].convert_objects(convert_numeric=True)
gapmind[‘lifeexpectancy’] = gapmind[‘lifeexpectancy’].convert_objects(convert_numeric=True)
gapmind[‘hivrate’] = gapmind[‘hivrate’].convert_objects(convert_numeric=True)
gapmind[‘urbanrate’] = gapmind[‘urbanrate’].convert_objects(convert_numeric=True)

#Set missing values as NaN
gapmind[‘polityscore’]= gapmind[‘polityscore’].replace(‘ ‘, numpy.nan)
gapmind[‘incomeperperson’]= gapmind[‘incomeperperson’].replace(‘ ‘, numpy.nan)
gapmind[‘lifeexpectancy’]= gapmind[‘lifeexpectancy’].replace(‘ ‘, numpy.nan)
gapmind[‘hivrate’]= gapmind[‘hivrate’].replace(‘ ‘, numpy.nan)
gapmind[‘urbanrate’]= gapmind[‘urbanrate’].replace(‘ ‘, numpy.nan)

scat1 = seaborn.regplot(x=”incomeperperson”, y=”hivrate”, fit_reg=True, data=gapmind)
plt.xlabel(‘incomeperperson’)
plt.ylabel(‘HIV rate’)
plt.title(‘Scatterplot for the Association Income per person and HIV rate’)

scat2 = seaborn.regplot(x=”lifeexpectancy”, y=”hivrate”, fit_reg=True, data=gapmind)
plt.xlabel(‘Life expectancy’)
plt.ylabel(‘HIV rate’)
plt.title(‘Scatterplot for the Association Between Life Expectancy and HIV rate’)

scat3 = seaborn.regplot(x=”polityscore”, y=”hivrate”, fit_reg=True, data=gapmind)
plt.xlabel(‘Polity score’)
plt.ylabel(‘HIV rate’)
plt.title(‘Scatterplot for the Association Between Polity score and HIV rate’)

scat3 = seaborn.regplot(x=”urbanrate”, y=”hivrate”, fit_reg=True, data=gapmind)
plt.xlabel(‘Urban rate’)
plt.ylabel(‘HIV rate’)
plt.title(‘Scatterplot for the Association Between Urban rate and HIV rate’)

gapmind_clean=gapmind.dropna()

print (‘Association Income per person and HIV rate’)
print (scipy.stats.pearsonr(gapmind_clean[‘incomeperperson’], gapmind_clean[‘hivrate’]))

print (‘Association Between Life expectancy and HIV rate’)
print (scipy.stats.pearsonr(gapmind_clean[‘lifeexpectancy’], gapmind_clean[‘hivrate’]))

print (‘Association Between Polity score and HIV rate’)
print (scipy.stats.pearsonr(gapmind_clean[‘polityscore’], gapmind_clean[‘hivrate’]))

print (‘Association Between Urban rate and HIV rate’)
print (scipy.stats.pearsonr(gapmind_clean[‘urbanrate’], gapmind_clean[‘hivrate’]))

Scatterplots obtained show that these 3 indicators are associated with a decrease in HIV rate:

Below is the output displaying Pearson Correlation coefficients and their p-values:

Capture

Correlation coefficients show that there are weak negative associations between income per person (r=-0.20), policy score (-0.09), urban rate (r=-0.28) and HIV rate, and a moderate negative association between life expectancy (r=-0.58) and HIV rate. P-values of correlation coefficients show that the negative association between income per person (p-value = 0.02) and HIV rate is statistically significant, the negative association between life expectancy (p-value < 0.001) and HIV rate is statistically significant, the negative association between  urban rate (p-value = 0.001) and HIV rate is statistically significant, but the negative association between policy score (p-value = 0.30) and HIV rate is not statistically significant at α = 0.05 level of confidence.

Advertisements

3 thoughts on “Pearson Correlation with Python

  1. Thank you for a great post. I did not know that you could change the type variables have in the pandas dataframe (e.g. to numeric). That is really useful. Also, I learned how to do correlation analysis using NumPy. Realy useful, information. Pandas is much like R’s dataframe and I like it. Python is great.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s