Running ANOVA with Python

This time, I learnt how to perform an analysis of variance (ANOVA) with Python. To try it by myself, I decide to check whether 1) there is an association between CO2 emissions and urban rate, and 2) there is an association between income per person and policy score.

I recoded the urban rate into a categorical variable by splitting it into its four quartiles: 1=25th%tile, 2=50%tile, 3=75%tile, and 4=100%tile. I recoded policy score into a categorical variable with 2 levels: policy score less or equal to 0 and policy score above 0.

The Python code I wrote test the hypothesis that 1) the CO2 emission means are equal across the four quartiles of urban rate, 2) the means income per person are equal for both the group of countries with policy score less or equal to 0 and the group of countries with policy score higher than 0.

Here is the Python code:

@author: DEGNINOU
“””
#Loading pandas and numpy packages
import numpy
import pandas
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

#Read gapminder.csv data file
gapmind = pandas.read_csv(“gapminder.csv”, low_memory=False)

#Set PANDAS to show all columns in DataFrame
pandas.set_option(‘display.max_columns’, None)
#Set PANDAS to show all rows in DataFrame
pandas.set_option(‘display.max_rows’, None)

# bug fix for display formats to avoid run time errors
pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#setting variables you will be working with to numeric
gapmind[‘urbanrate’] = gapmind[‘urbanrate’].convert_objects(convert_numeric=True)
gapmind[‘incomeperperson’] = gapmind[‘incomeperperson’].convert_objects(convert_numeric=True)
gapmind[‘co2emissions’] = gapmind[‘co2emissions’].convert_objects(convert_numeric=True)
gapmind[‘polityscore’] = gapmind[‘polityscore’].convert_objects(convert_numeric=True)

# quartile split (use qcut function & ask for 4 groups – gives you quartile split)
print (‘Urban rate – 4 categories – quartiles’)
gapmind[‘urbangrp’]=pandas.qcut(gapmind.urbanrate, 4, labels=[“1=25th%tile”,”2=50%tile”,”3=75%tile”,”4=100%tile”])
#c12 = gapmind[‘incomegrp’].value_counts(sort=False, dropna=True)
#print(c12)

#creating 2 level polityscore variable
def policygrp (row):
if row[‘polityscore’] <= 0 :
return 1
elif row[‘polityscore’] > 0 :
return 2
#Apply the new variable policygrp to the gapmind dataset
gapmind[‘policygrp’] = gapmind.apply (lambda row: policygrp (row),axis=1)

print (‘Frequency distribution of policygrp’)
c3= gapmind.groupby(‘policygrp’).size()
print (c3)

# using ols function for calculating the F-statistic and associated p value
#gapmind2 will be the subset used for this section
gapmind2 = gapmind[[‘incomeperperson’, ‘policygrp’]].dropna()
print (‘ANOVA of incomeperperson by policygrp’)
model2 = smf.ols(formula=’incomeperperson ~ C(policygrp)’, data=gapmind2).fit()
print (model2.summary())

print (‘means for incomeperperson by policyscore group’)
m2= gapmind2.groupby(‘policygrp’).mean()
print (m2)
print (‘standard deviations for for incomeperperson by policyscore group’)
sd2 = gapmind2.groupby(‘policygrp’).std()
print (sd2)

#gapmind1 will be the subset used for this section
gapmind1 = gapmind[[‘co2emissions’, ‘urbangrp’]].dropna()
print (‘ANOVA of co2emissions by urbangrp’)
model1 = smf.ols(formula=’co2emissions ~ C(urbangrp)’, data=gapmind1).fit()
print (model1.summary())

print (‘means for co2emissions by urbanrate groups’)
m1= gapmind1.groupby(‘urbangrp’).mean()
print (m1)
print (‘standard deviations for co2emissions by urban rate groups’)
sd1 = gapmind1.groupby(‘urbangrp’).std()
print (sd1)

print (‘ANOVA and Tukey HSD test of incomeperperson by policygrp’)
mc1 = multi.MultiComparison(gapmind1[‘co2emissions’], gapmind1[‘urbangrp’]).tukeyhsd()
#res1 = mc1.tukeyhsd()
print(mc1.summary())

The output:

Frequency distribution of policygrp
policygrp
1.000000 52
2.000000 109
dtype: int64

ANOVA of incomeperperson by policygrp
OLS Regression Results
=========================================================
Dep. Variable: incomeperperson R-squared: 0.031
Model: OLS Adj. R-squared: 0.024
Method: Least Squares F-statistic: 4.846
Date: Sat, 02 Jan 2016 Prob (F-statistic): 0.0292
Time: 12:32:02 Log-Likelihood: -1640.8
No. Observations: 155 AIC: 3286.
Df Residuals: 153 BIC: 3292.
Df Model: 1

means for incomeperperson by policyscore group
incomeperperson
policygrp
1.000000 4022.079787
2.000000 7728.437685
standard deviations for for incomeperperson by policyscore group
incomeperperson
policygrp
1.000000 7415.879260
2.000000 10445.651332

ANOVA of co2emissions by urbangrp
OLS Regression Results
=========================================================
Dep. Variable: co2emissions R-squared: 0.030
Model: OLS Adj. R-squared: 0.014
Method: Least Squares F-statistic: 1.909
Date: Sat, 02 Jan 2016 Prob (F-statistic): 0.130
Time: 12:32:02 Log-Likelihood: -4875.2
No. Observations: 192 AIC: 9758.
Df Residuals: 188 BIC: 9771.
Df Model: 3

means for co2emissions by urbanrate groups
co2emissions
urbangrp
1=25th%tile 919761099.285930
2=50%tile 2634822559.996200
3=75%tile 4793284870.770551
4=100%tile 12843154268.120001
standard deviations for co2emissions by urban rate groups
co2emissions
urbangrp
1=25th%tile 4454403852.274989
2=50%tile 14269607363.303205
3=75%tile 9832484516.577198
4=100%tile 49891034848.790657

ANOVA and Tukey HSD test of incomeperperson by policygrp
Multiple Comparison of Means – Tukey HSD,FWER=0.05
=========================================================
group1 group2 meandiff lower upper reject
———————————————————
1=25th%tile 2=50%tile 1715061460.71 -12002693688.6 15432816610.0 False
1=25th%tile 3=75%tile 3873523771.48 -9911888485.06 17658936028.0 False
1=25th%tile 4=100%tile 11923393168.8 -2080369448.0 25927155785.7 False
2=50%tile 3=75%tile 2158462310.77 -11414219438.1 15731144059.6 False
2=50%tile 4=100%tile 10208331708.1 -3586068104.91 24002731521.2 False
3=75%tile 4=100%tile 8049869397.35 -5811813435.11 21911552229.8 False
———————————————————

The ANOVA results show that, at α = 0.05 level of significance, there is a difference in the mean income per person for countries with low policy score (policy score less or equal to 0) as compared to countries with high policy scores (policy score greater than 0). The mean income per person for countries with low policy scores (4022.08, 7415.87 SD), is significantly lower than the mean income per person for countries with higher policy scores (7728.44, 10445.65 SD), at α = 0.05 level of significance (p-value = 0.03).

For the assessment of the association between CO2 emission and urban rate, the ANOVA results show that there is no difference in the means of CO2 emission across the 4 quartiles of urban, at α = 0.05 level of significance (p-value = 0.13).

 

Advertisements

One thought on “Running ANOVA with Python

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s