Chi Square Test with Python

This time, I would like to run a Chi-square (χ2) test by coding with Python in Anaconda’s Scientific PYthon Development EnviRonment (Spyder).

I am interested in assessing the association between income per person (2010 Gross Domestic Product per capita in constant 2000 US$) and life expectancy (2011 life expectancy at birth (years). The average number of years a newborn child would live if current mortality patterns were to stay the same). In other words, I want to test the hypothesis that income per person and life expectancy are independent using χ2 test of independence.

As my two variables of interest are continuous, I categorized income per person by splitting it into its 4 quartiles, and split life expectancy into two categorized: low life expectancy (below or equal 50 years) and high life expectancy (greater than 50 years).

Here is the code:

# -*- coding: utf-8 -*-
“””
Created on Sat Jan 9 17:50:16 2016

@author: DEGNINOU
“””
import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt

#Read gapminder.csv data file
gapmind = pandas.read_csv(“gapminder.csv”, low_memory=False)

#Set PANDAS to show all columns in DataFrame
pandas.set_option(‘display.max_columns’, None)
#Set PANDAS to show all rows in DataFrame
pandas.set_option(‘display.max_rows’, None)

# bug fix for display formats to avoid run time errors
pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#setting variables I will be working with to numeric
gapmind[‘incomeperperson’] = gapmind[‘incomeperperson’].convert_objects(convert_numeric=True)
gapmind[‘lifeexpectancy’] = gapmind[‘lifeexpectancy’].convert_objects(convert_numeric=True)

# quartile split (use qcut function & ask for 4 groups – gives quartile split)
#print (‘incomeperperson – 4 categories – quartiles’)
gapmind[‘incomegrp’]=pandas.qcut(gapmind.incomeperperson, 4, labels=[“1=25th%tile”,”2=50%tile”,”3=75%tile”,”4=100%tile”])
#c12 = gapmind[‘incomegrp’].value_counts(sort=False, dropna=True)
#print(c12)

#print (‘lifeexpectancy – 4 categories – quartiles’)
gapmind[‘lifegrp’]=pandas.qcut(gapmind.lifeexpectancy, 4, labels=[“1=25th%tile”,”2=50%tile”,”3=75%tile”,”4=100%tile”])
#c13 = gapmind[‘lifegrp’].value_counts(sort=False, dropna=True)
#print(c13)

#print (‘co2emissions – 4 categories – quartiles’)
gapmind[‘co2grp’]=pandas.qcut(gapmind.co2emissions, 4, labels=[“1=CO2.25th%tile”,”2=CO2.50%tile”,”3=CO2.75%tile”,”4=CO2.100%tile”])
#c13 = gapmind[‘lifegrp’].value_counts(sort=False, dropna=True)
#print(c13)

#creating 2 level life expentancy variable
def life2grp (row):
if row[‘lifeexpectancy’] <= 50 :
return 0
elif row[‘lifeexpectancy’] > 50 :
return 1
#Apply the new variable life2grp to the gapmind dataset
gapmind[‘life2grp’] = gapmind.apply (lambda row: life2grp (row),axis=1)

print (“Contingency table of life2grp and incomegrp”)
print (“Life2grp: 1 Low life expentancy, 2 = high life expentancy”)
# contingency table of observed counts
ct1=pandas.crosstab(gapmind[‘life2grp’], gapmind[‘incomegrp’])
print (ct1)
# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)

# chi-square
print (‘chi-square value, p value, expected counts’)
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)

# graph percent with high life expentency within each income quartile range
seaborn.factorplot(x=”incomegrp”, y=”life2grp”, data=gapmind, kind=”bar”, ci=None)
plt.xlabel(‘Income per person quartiles’)
plt.ylabel(‘Proportion high life expetancy’)

recode1 = {‘1=25th%tile’:’1=25th%tile’, ‘2=50%tile’:’2=50%tile’}
gapmind[‘comp1’]= gapmind[‘incomegrp’].map(recode1)

# contingency table of observed counts
ct2=pandas.crosstab(gapmind[‘life2grp’], gapmind[‘comp1’])
print (ct2)
# column percentages
colsum=ct2.sum(axis=0)
colpct2=ct2/colsum
print(colpct2)

# chi-square
print (‘chi-square value, p value, expected counts’)
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)

recode2 = {‘1=25th%tile’:’1=25th%tile’, ‘3=75%tile’:’3=75%tile’}
gapmind[‘comp2’]= gapmind[‘incomegrp’].map(recode2)

# contingency table of observed counts
ct3=pandas.crosstab(gapmind[‘life2grp’], gapmind[‘comp2’])
print (ct3)
# column percentages
colsum=ct3.sum(axis=0)
colpct3=ct3/colsum
print(colpct3)

# chi-square
print (‘chi-square value, p value, expected counts’)
cs3= scipy.stats.chi2_contingency(ct3)
print (cs3)

recode3 = {‘1=25th%tile’:’1=25th%tile’, ‘4=100%tile’:’4=100%tile’}
gapmind[‘comp3’]= gapmind[‘incomegrp’].map(recode3)

# contingency table of observed counts
ct4=pandas.crosstab(gapmind[‘life2grp’], gapmind[‘comp3’])
print (ct4)
# column percentages
colsum=ct4.sum(axis=0)
colpct4=ct4/colsum
print(colpct4)

# chi-square
print (‘chi-square value, p value, expected counts’)
cs4= scipy.stats.chi2_contingency(ct4)
print (cs4)

recode4 = {‘2=50%tile’:’2=50%tile’, ‘3=75%tile’:’3=75%tile’}
gapmind[‘comp4’]= gapmind[‘incomegrp’].map(recode4)

# contingency table of observed counts
ct5=pandas.crosstab(gapmind[‘life2grp’], gapmind[‘comp4’])
print (ct5)
# column percentages
colsum=ct5.sum(axis=0)
colpct5=ct5/colsum
print(colpct5)

# chi-square
print (‘chi-square value, p value, expected counts’)
cs5= scipy.stats.chi2_contingency(ct5)
print (cs5)

recode6 = {‘3=75%tile’:’3=75%tile’, ‘4=100%tile’:’4=100%tile’}
gapmind[‘comp6’]= gapmind[‘incomegrp’].map(recode6)

# contingency table of observed counts
ct7=pandas.crosstab(gapmind[‘life2grp’], gapmind[‘comp6’])
print (ct7)
# column percentages
colsum=ct7.sum(axis=0)
colpct7=ct7/colsum
print(colpct7)

# chi-square
print (‘chi-square value, p value, expected counts’)
cs7= scipy.stats.chi2_contingency(ct7)
print (cs7)

Here as the relevant part of the output:

Life2grp: 1 Low life expentancy, 2 = high life expentancy
incomegrp 1=25th%tile 2=50%tile 3=75%tile 4=100%tile
life2grp
0.000000 7 1 0 0
1.000000 41 43 43 41
incomegrp 1=25th%tile 2=50%tile 3=75%tile 4=100%tile
life2grp
0.000000 0.145833 0.022727 0.000000 0.000000
1.000000 0.854167 0.977273 1.000000 1.000000
chi-square value, p value, expected counts
(15.670634920634921, 0.0013246322827279819, 3, array([[ 2.18181818, 2. , 1.95454545, 1.86363636],
[ 45.81818182, 42. , 41.04545455, 39.13636364]]))
comp1 1=25th%tile 2=50%tile
life2grp
0.000000 7 1
1.000000 41 43
comp1 1=25th%tile 2=50%tile
life2grp
0.000000 0.145833 0.022727
1.000000 0.854167 0.977273
chi-square value, p value, expected counts
(2.9686034451659462, 0.084895114122459217, 1, array([[ 4.17391304, 3.82608696],
[ 43.82608696, 40.17391304]]))
comp2 1=25th%tile 3=75%tile
life2grp
0.000000 7 0
1.000000 41 43
comp2 1=25th%tile 3=75%tile
life2grp
0.000000 0.145833 0.000000
1.000000 0.854167 1.000000
chi-square value, p value, expected counts
(4.8948138727390171, 0.026937477009160795, 1, array([[ 3.69230769, 3.30769231],
[ 44.30769231, 39.69230769]]))
comp3 1=25th%tile 4=100%tile
life2grp
0.000000 7 0
1.000000 41 41
comp3 1=25th%tile 4=100%tile
life2grp
0.000000 0.145833 0.000000
1.000000 0.854167 1.000000
chi-square value, p value, expected counts
(4.6331515484688826, 0.031359908810729059, 1, array([[ 3.7752809, 3.2247191],
[ 44.2247191, 37.7752809]]))
comp4 2=50%tile 3=75%tile
life2grp
0.000000 1 0
1.000000 43 43
comp4 2=50%tile 3=75%tile
life2grp
0.000000 0.022727 0.000000
1.000000 0.977273 1.000000
chi-square value, p value, expected counts
(0.00013367176360686402, 0.99077534303616221, 1, array([[ 0.50574713, 0.49425287],
[ 43.49425287, 42.50574713]]))
comp6 3=75%tile 4=100%tile
life2grp
1.000000 43 41
comp6 3=75%tile 4=100%tile
life2grp
1.000000 1.000000 1.000000
chi-square value, p value, expected counts
(0.0, 1.0, 0, array([[ 43., 41.]]))

The χ2 test shows that there is an association between income per person and life expectancy, at α = 0.05 level of confidence (p-value = 0.008). I further compare the quartile groups two by two with a Bonferroni adjusted level of confidence equal 0.008 (Bonferroni adjusted α = 0.05/6).

P-values of the two by two comparison of interquartile groups are presented in the table below:

  1=25th%tile 2=50%tile  3=75%tile  4=100%tile
1=25th%tile *      
2=50%tile  0.084 *    
3=75%tile  0.027 0.990 *  
4=100%tile 0.031 0.972 1.00 *

These p-values show that there is no association between the four quartile groups of income per person (when compared two by two) and life expectancy, at α = 0.008 Bonferroni adjusted level of confidence.

Advertisements

One thought on “Chi Square Test with Python

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s