Machine Learning: k-means cluster analysis with Python

A k-means cluster analysis was conducted to identify underlying subgroups of adolescents based on their similarity of responses on 18 variables that represent characteristics that could have an impact adolescents self-esteem. Clustering variables included gender, ethnicity (Hispanic, White, Black, Non american, Asian), age, two binary variables measuring whether or not the adolescent had ever used alcohol or marijuana, as well as quantitative variables measuring alcohol problems, a scale measuring engaging in deviant behaviors (such as vandalism, other property damage, lying, stealing, running away, driving without permission, selling drugs, and skipping school), and scales measuring violence, depression, parental presence, parental activities, family connectedness, school performance (GPA1), and school connectedness. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.

The Python script written for the purpose is accessible here.

Data were randomly split into a training set that included 70% of the observations (N= 3202) and a test set that included 30% of the observations (N=1373). A series of k-means cluster analyses were conducted on the training data specifying k=1-5 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the five cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.

Figure 1. Elbow curve of r-square values for the nine cluster solutions


The elbow curve was inconclusive, suggesting that the 2 and 3-cluster solutions might be interpreted. The results below are for an interpretation of the 3-cluster solution.

Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.


Canonical discriminant analyses was used to reduce the 18 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (Figure 2) indicated that the observations in all 3 clusters were densely packed with a high in-cluster variance, and overlap very much one with another. Observations in cluster 2nd cluster (in blue) were slightly spread out. The results of this plot suggest that 3 clusters is an acceptable cluster solution for our analysis.




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s