Machine Learning: Building a Random Forest with Python

Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting regular smoking among adolescent – a binary categorical response variable. The following explanatory variables were included as possible contributors to a random forest evaluating the response variable: gender, age, (race/ethnicity) Hispanic, White, Black, Native American and Asian, alcohol use, alcohol problems, marijuana use, cocaine use, inhalant use, availability of cigarettes in the home, depression, and self-esteem.

The Python code:

# -*- coding: utf-8 -*-
“””
Created on Sun Feb 7 17:02:32 2016

@author: DEGNINOU
“””
#%%
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
# Feature Importance
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier

#os.chdir(“C:\TREES”)

#%% Load the dataset
AH_data = pd.read_csv(“tree_addhealth.csv”)
data_clean = AH_data.dropna()

data_clean.dtypes
data_clean.describe()
#%% Split into training and testing sets
predictors = data_clean[[‘BIO_SEX’,’age’,’ALCEVR1′,’ALCPROBS1′,’marever1′,
‘cocever1′,’inhever1′,’cigavail’,’DEP1′,’ESTEEM1′]]

targets = data_clean.TREG1

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

pred_train.shape
pred_test.shape
tar_train.shape
tar_test.shape
#%% Build model on training data
from sklearn.ensemble import RandomForestClassifier

classifier=RandomForestClassifier(n_estimators=18)
classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)
#%%

# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train,tar_train)
# display the relative importance of each attribute
print(model.feature_importances_)

#%%
“””
Running a different number of trees and see the effect
of that on the accuracy of the prediction
“””

trees=range(25)
accuracy=np.zeros(25)

for idx in range(len(trees)):
classifier=RandomForestClassifier(n_estimators=idx + 1)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)

plt.cla()
plt.plot(trees, accuracy)
#%%

The output:  

CaptureCapture1

Assessment of the accuracy of the prediction
by running 1 to 25 trees and see the effect
(see figure below) :

qt_img108057082200068.png

The explanatory variables with the highest relative importance scores were age, marijuana use, depression, and self-esteem. The accuracy of the random forest was 81.42%, with the subsequent growing of multiple trees (with number of estimators equal 18) rather than a single tree, adding little to the overall accuracy of the model, and suggesting that interpretation of a single decision tree may be appropriate.

Advertisements

One thought on “Machine Learning: Building a Random Forest with Python

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s