### Exploratory data analysis with Python and SAS

In this blog, I am sharing my first program in Python. As I said earlier in my previous post, I will be using the GapMinder dataset provided by the Coursera course on Data Management and Visualization.

I run frequency distributions of 2009 estimated HIV Prevalence (%), 2009 Democracy score (Polity), and 2008 urban population (% of total). The task was completed using the following codes.

In Python:

“””
Python program with Panda to perform an exploratory analysis of a GapMinder data
provided throught Coursera
“””

import pandas
import numpy

print(len(gapmind)) #Display the number of rows in the dataset
print(len(gapmind.columns)) #Display the number of columns in the dataset

#Compute frequency count for HVI rate and Polity score

print (‘non-sorted counts for HIV rate’)
c1 = gapmind[‘hivrate’]. value_counts(sort=False) #Compute frequencies without %
print(c1)
print (‘non-sorted counts for polity score’)
c2 = gapmind[‘polityscore’]. value_counts(sort=False)
print(c2)

print (‘non-sorted frequency of HIV rate’)
c3 = gapmind[‘hivrate’]. value_counts(sort=False, normalize=True) #Compute frequencies %
print(c3)
print (‘non-sorted frequency of polity score’)
c4 = gapmind[‘polityscore’]. value_counts(sort=False, normalize=True)
print(c4)

print (‘sorted counts for HIV rate’)
p1 = gapmind[‘hivrate’]. value_counts(sort=True, dropna=True) #Compute frequencies without %
print(p1)
print (‘sorted counts for polity score’)
p2 = gapmind[‘polityscore’]. value_counts(sort=True, dropna=True)
print(p2)

print (‘sorted frequency of HIV rate’)
p3 = gapmind[‘hivrate’]. value_counts(sort=True, normalize=True, dropna=True) #Compute frequencies %
print(p3)
print (‘sorted frequency of polity score’)
p4 = gapmind[‘polityscore’]. value_counts(sort=True, normalize=True, dropna=True)
print(p4)

# Freqeuncy disributions using the ‘bygroup’ function
print (‘Frequency of HIV rate’)
chivr= gapmind.groupby(‘hivrate’).size()
print (chivr)
phivr = gapmind.groupby(‘hivrate’).size() * 100 / len(gapmind)
print (phivr)

print (‘Frequency of Polity score’)
chivr= gapmind.groupby(‘polityscore’).size()
print (chivr)
phivr = gapmind.groupby(‘polityscore’).size() * 100 / len(gapmind)
print (phivr)

print (‘Frequency of Urban rate’)
curb= gapmind.groupby(‘urbanrate’).size()
print (curb)
purb = gapmind.groupby(‘urbanrate’).size() * 100 / len(gapmind)
print (purb)

In SAS:

PROC IMPORT OUT= WORK.gapmind
DATAFILE= ” ….Google Drive\Data Science\Python Codes\gapminder.csv”
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;

Title ‘Frequency of HIV rate’;
proc freq data =gapmind order =freq;
table HIVrate;
run;

Title ‘Frequency of polityscore’;
proc freq data =gapmind order =freq;
table polityscore;
run;

Title ‘Frequency of urbanrat’;
proc freq data =gapmind order =freq;
table urbanrate;
run;

The SAS  code resulted in the following outputs:

Figure 1: Frequency outputs of HIV rate, Polity score and Urban rate.

To summarize these codes provide us the following informations about HIV rate, Polity score and Urban rate. HIV rates are comprised between 0.06 and 25.92%. Polity scores range between -10 and 10, and Urban rates range between  10.40% and 100.00%.

19.05% of word countries have HIV rates around 0.10%, 10.88% of them have rates of about 0.06%, and 0.68% of them have HIV rate around 5%.
20.50%, 4.35%, 1.86% and 1.24% of word countries have polity score of 10, 5, -6 and -10 respectively. 6 (2.96%) and 2 (0.99), word countries have 100%, and 27.84%, 36.84%, 61.34%, 65.58% of urban population. 66(30.99%), 52(24.41%) and 10(4.69%) countries have missing data for HIV rate, Polity score and Urban rate, respectively.