Managing data with Python

This is the third week that I am sharing my experience throughout the Coursera course on Data Analysis and Interpretation Specialization. This week is dedicated to data management. The last week, I run frequency distributions of HIV rate, polity score and urban rate. From these frequency distributions, I found that the response values are too spread for the variables. For example, the HIV rate has 46 response values.

To appropriately handle this issue, I decided to group response values within the 3 variables as follow.

HIV rate values New values
   Less than 1 1
   1 to 5 2
   Above 5 to 10 3
   Above 10 to 15 4
   Above 15 to 20 5
   Above 20 6
Polity score values  
   -10 to 0 1
   Above 0 to 5 2
   Above 5 to 10 3
Urban rate  
   Below 25 1
   Betwen 25 and 50 2
   Betwen 50 and 100 3

After importing the GapMind data set in Python, I performed the recoding and run frequency distributions of new variables with the following code:

#Loading pandas and numpy packages
import pandas
import numpy

#Read gapminder.csv data file
gapmind = pandas.read_csv(“gapminder.csv”, low_memory=False)

# subset variables in new data frame, sub1
gapmind1=gapmind[[‘hivrate’,’polityscore’, ‘urbanrate’]]
a = gapmind1.head (n=10)
print(a)

#making variables numeric
gapmind1[‘hivrate’] = gapmind1[‘hivrate’].convert_objects(convert_numeric=True)
gapmind1[‘polityscore’] = gapmind1[‘polityscore’].convert_objects(convert_numeric=True)
gapmind1[‘polityscore’] = gapmind1[‘polityscore’].convert_objects(convert_numeric=True)

#new hivrate variable, categorical 1 through 6
def hivclass (row):
if row[‘hivrate’] < 1:
return 1
if row[‘hivrate’] >= 1 and row[‘hivrate’] <= 5:
return 2
if row[‘hivrate’] > 5 and row[‘hivrate’] <= 10:
return 3
if row[‘hivrate’] > 10 and row[‘hivrate’] <= 15:
return 4
if row[‘hivrate’] > 15 and row[‘hivrate’] <= 20:
return 5
if row[‘hivrate’] > 20:
return 6

#new polityscore variable, categorical 1 through 3
def polityclass (row):
if row[‘polityscore’] <= 0:
return 1
if row[‘polityscore’] > 0 and row[‘polityscore’] <= 5:
return 2
if row[‘polityscore’] > 5 and row[‘polityscore’] <= 10:
return 3
#new urbanrate variable, categorical 1 through 3
def urbanclass (row):
if row[‘urbanrate’] < 25:
return 1
if row[‘urbanrate’] >= 25 and row[‘urbanrate’] <= 50:
return 2
if row[‘urbanrate’] > 50 and row[‘urbanrate’] <= 100:
return 3
gapmind1[‘hivclass’] = gapmind1.apply (lambda row: hivclass (row), axis=1)
gapmind1[‘polityclass’] = gapmind1.apply (lambda row: polityclass (row), axis=1)
gapmind1[‘urbanclass’] = gapmind1.apply (lambda row: polityclass (row), axis=1)

# Freqeuncy disributions using the ‘bygroup’ function
print (‘Frequency of HIV rate’)
print (‘1= HIV rate less than 1%, 2= [1-5%[, 3= [5-10%[, 4=[10-15%[, 5=[15-20%[, 6= more than 20% HIV rate’)
chivr= gapmind1.groupby(‘hivclass’).size()
print (chivr)
phivr = gapmind1.groupby(‘hivclass’).size() * 100 / len(gapmind1)
print (phivr)

print (‘Frequency of Polity score’)
print (‘1=-10 to -1 score, 2= 0 to 5, 3 = 5 to 10 score’)
chivr= gapmind1.groupby(‘polityclass’).size()
print (chivr)
phivr = gapmind1.groupby(‘polityclass’).size() * 100 / len(gapmind1)
print (phivr)

print (‘Frequency of Urban rate’)
print (‘1= 10 to 25% urban rate, 2 = 25 to 50%, 3= 50 to 100%’)
curb= gapmind1.groupby(‘urbanclass’).size()
print (curb)
purb = gapmind1.groupby(‘urbanclass’).size() * 100 / len(gapmind1)
print (purb)

The code generated the output displayed below:

Frequency of HIV rate
1= HIV rate less than 1%, 2= [1-5%[, 3= [5-10%[, 4=[10-15%[, 5=[15-20%[, 6= more than 20% HIV rate
hivclass
1    99
2    34
3    5
4    5
5    1
6    3
dtype: int64
hivclass
1    46.478873
2    15.962441
3    2.347418
4    2.347418
5    0.469484
6    1.408451
dtype: float64
Frequency of Polity score
1=-10 to -1 score, 2= 0 to 5, 3 = 5 to 10 score
polityclass
1    52
2    19
3    90
dtype: int64
polityclass
1    24.413146
2    8.920188
3    42.253521
dtype: float64
Frequency of Urban rate
1= 10 to 25% urban rate, 2 = 25 to 50%, 3= 50 to 100%
urbanclass
1 52
2 19
3 90
dtype: int64
urbanclass
1 24.413146
2 8.920188
3 42.253521
dtype: float64

The output shows that 46.48% countries have HIV rates below 1% and 1.41% of them have HIV rates above 20%. Almost 25% of countries have polity scores below 0 and 42% of them have polity score ranging between 5 and 10. With regard to urbanity, almost 25% of countries have urban rates below 25% and 42% of them have urban rates ranging between 50 and 100%.

Advertisements

One thought on “Managing data with Python

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s