Machine Learning: Growing a Decision Tree with Python

Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable. The training sample and the test sample were set at a ratio of 40/60. For the present analyses, the maximum number of nodes was limited to 5.

The following explanatory variables were included as possible contributors to a classification tree model evaluating smoking experimentation (the response variable): gender, age, (race/ethnicity) Hispanic, White, Black, Native American and Asian, alcohol use, alcohol problems, marijuana use, cocaine use, inhalant use, availability of cigarettes in the home, depression, and self-esteem.

Here is the Python code written for this purpose:

# -*- coding: utf-8 -*-
“””
Created on Sun Feb 7 17:02:32 2016

@author: DEGNINOU
“””
#%%
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
#%%
#os.chdir (‘C:/Users/DEGNINOU/Google Drive/Data Science/TREES’)

“””
Data Engineering and Analysis
“””
#Load the dataset

AH_data = pd.read_csv(“tree_addhealth.csv”)

data_clean = AH_data.dropna()

data_clean.dtypes
data_clean.describe()

#%%
“””
Modeling and Prediction
“””
#Split into training and testing sets

predictors = data_clean[[‘BIO_SEX’,’age’,’ALCEVR1′,’ALCPROBS1′,’marever1′,
‘cocever1′,’inhever1′,’cigavail’,’DEP1′,’ESTEEM1′]]

targets = data_clean.TREG1

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

pred_train.shape
pred_test.shape
tar_train.shape
tar_test.shape
#%%
#Build model on training data
classifier=DecisionTreeClassifier(max_leaf_nodes=5)
classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)
#%%
#Displaying the decision tree
from sklearn import tree
#from StringIO import StringIO
from io import StringIO
#from StringIO import StringIO
from IPython.display import Image
out = StringIO()
tree.export_graphviz(classifier, out_file=out)
import pydotplus
graph=pydotplus.graph_from_dot_data(out.getvalue())
Image(graph.create_png())
with open(‘DecisionTree102.png’,’wb’) as f:
f.write(graph.create_png())
#%%

The output:

CaptureCapture2

The decision tree generated from the code:

DecisionTree102

Marijuana use was the first variable to separate the sample into two subgroups. From the total sample of 2745 adolescents 2092 reported no marijuana use and 653 reported marijuana use. Among non-users of marijuana, another split is made on alcohol use: the sample of 2092 non-marijuana users is split between 1199 non-alcohol users and 893 alcohol users. Among the 1199 non-alcohol users, 1162 are non-regular smokers and 37 are regular smokers. Alcohol users are split into two groups by alcohol problems: 680 had no alcohol problem, among whom 83 are regular smokers, and 213 had alcohol problems among whom 54 are regular smokers.

The sub-sample of 653 adolescent who reported marijuana use was split into two groups by availability of cigarettes in the home. Among them, 391 had no cigarette available at home and 156/391 were regular smokers. 150/262 were regular smokers among those who had cigarette available at home.

Advertisements

One thought on “Machine Learning: Growing a Decision Tree with Python

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s