### Managing data with Python

This is the third week that I am sharing my experience throughout the Coursera course on Data Analysis and Interpretation Specialization. This week is dedicated to data management. The last week, I run frequency distributions of HIV rate, polity score and urban rate. From these frequency distributions, I found that the response values are too spread for the variables. For example, the HIV rate has 46 response values.

To appropriately handle this issue, I decided to group response values within the 3 variables as follow.

 HIV rate values New values Less than 1 1 1 to 5 2 Above 5 to 10 3 Above 10 to 15 4 Above 15 to 20 5 Above 20 6 Polity score values -10 to 0 1 Above 0 to 5 2 Above 5 to 10 3 Urban rate Below 25 1 Betwen 25 and 50 2 Betwen 50 and 100 3

After importing the GapMind data set in Python, I performed the recoding and run frequency distributions of new variables with the following code:

import pandas
import numpy

# subset variables in new data frame, sub1
gapmind1=gapmind[[‘hivrate’,’polityscore’, ‘urbanrate’]]
print(a)

#making variables numeric
gapmind1[‘hivrate’] = gapmind1[‘hivrate’].convert_objects(convert_numeric=True)
gapmind1[‘polityscore’] = gapmind1[‘polityscore’].convert_objects(convert_numeric=True)
gapmind1[‘polityscore’] = gapmind1[‘polityscore’].convert_objects(convert_numeric=True)

#new hivrate variable, categorical 1 through 6
def hivclass (row):
if row[‘hivrate’] < 1:
return 1
if row[‘hivrate’] >= 1 and row[‘hivrate’] <= 5:
return 2
if row[‘hivrate’] > 5 and row[‘hivrate’] <= 10:
return 3
if row[‘hivrate’] > 10 and row[‘hivrate’] <= 15:
return 4
if row[‘hivrate’] > 15 and row[‘hivrate’] <= 20:
return 5
if row[‘hivrate’] > 20:
return 6

#new polityscore variable, categorical 1 through 3
def polityclass (row):
if row[‘polityscore’] <= 0:
return 1
if row[‘polityscore’] > 0 and row[‘polityscore’] <= 5:
return 2
if row[‘polityscore’] > 5 and row[‘polityscore’] <= 10:
return 3
#new urbanrate variable, categorical 1 through 3
def urbanclass (row):
if row[‘urbanrate’] < 25:
return 1
if row[‘urbanrate’] >= 25 and row[‘urbanrate’] <= 50:
return 2
if row[‘urbanrate’] > 50 and row[‘urbanrate’] <= 100:
return 3
gapmind1[‘hivclass’] = gapmind1.apply (lambda row: hivclass (row), axis=1)
gapmind1[‘polityclass’] = gapmind1.apply (lambda row: polityclass (row), axis=1)
gapmind1[‘urbanclass’] = gapmind1.apply (lambda row: polityclass (row), axis=1)

# Freqeuncy disributions using the ‘bygroup’ function
print (‘Frequency of HIV rate’)
print (‘1= HIV rate less than 1%, 2= [1-5%[, 3= [5-10%[, 4=[10-15%[, 5=[15-20%[, 6= more than 20% HIV rate’)
chivr= gapmind1.groupby(‘hivclass’).size()
print (chivr)
phivr = gapmind1.groupby(‘hivclass’).size() * 100 / len(gapmind1)
print (phivr)

print (‘Frequency of Polity score’)
print (‘1=-10 to -1 score, 2= 0 to 5, 3 = 5 to 10 score’)
chivr= gapmind1.groupby(‘polityclass’).size()
print (chivr)
phivr = gapmind1.groupby(‘polityclass’).size() * 100 / len(gapmind1)
print (phivr)

print (‘Frequency of Urban rate’)
print (‘1= 10 to 25% urban rate, 2 = 25 to 50%, 3= 50 to 100%’)
curb= gapmind1.groupby(‘urbanclass’).size()
print (curb)
purb = gapmind1.groupby(‘urbanclass’).size() * 100 / len(gapmind1)
print (purb)

The code generated the output displayed below:

Frequency of HIV rate
1= HIV rate less than 1%, 2= [1-5%[, 3= [5-10%[, 4=[10-15%[, 5=[15-20%[, 6= more than 20% HIV rate
hivclass
1    99
2    34
3    5
4    5
5    1
6    3
dtype: int64
hivclass
1    46.478873
2    15.962441
3    2.347418
4    2.347418
5    0.469484
6    1.408451
dtype: float64
Frequency of Polity score
1=-10 to -1 score, 2= 0 to 5, 3 = 5 to 10 score
polityclass
1    52
2    19
3    90
dtype: int64
polityclass
1    24.413146
2    8.920188
3    42.253521
dtype: float64
Frequency of Urban rate
1= 10 to 25% urban rate, 2 = 25 to 50%, 3= 50 to 100%
urbanclass
1 52
2 19
3 90
dtype: int64
urbanclass
1 24.413146
2 8.920188
3 42.253521
dtype: float64

The output shows that 46.48% countries have HIV rates below 1% and 1.41% of them have HIV rates above 20%. Almost 25% of countries have polity scores below 0 and 42% of them have polity score ranging between 5 and 10. With regard to urbanity, almost 25% of countries have urban rates below 25% and 42% of them have urban rates ranging between 50 and 100%.