A lasso regression analysis was conducted to identify a subset of variables from a pool of 23 categorical and quantitative predictor variables that best predicted a quantitative response variable measuring adolescents’ grade point average (GPA). Categorical predictors included gender and a series of 5 binary categorical variables for race and ethnicity (Hispanic, White, Black, Native American and Asian) to improve interpretability of the selected model with fewer predictors. Binary substance use variables were measured with individual questions about whether the adolescent had ever used alcohol, marijuana, cocaine or inhalants. Additional categorical variables included the availability of cigarettes in the home, whether or not either parent was on public assistance and any experience with being expelled from school. Quantitative predictor variables include age, alcohol problems, and a measure of deviance that included such behaviors as vandalism, other property damage, lying, stealing, running away, driving without permission, selling drugs, and skipping school. Another scale for violence, one for depression, and others measuring self-esteem, parental presence, parental activities, family connectedness and school connectedness were also included. All predictor variables were standardized to have a mean of zero and a standard deviation of one.
Data were randomly split into a training set that included 70% of the observations (N=3201) and a test set that included 30% of the observations (N=1701). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
The Python code written for the purpose is accessible at the following link.
Output 1: Variable names and regression coefficients
Output 2: training data and test data’s MSE and R-square
Figure 1. Change in the validation mean square error at each step
Figure 2: Mean squared error on each fold
Of the 23 predictor variables, 22 were retained in the selected model. During the estimation process, black ethnicity and school connectedness were most strongly associated with adolescents’ GPA, followed by gender, parental activities, and violent behavior. Engaging in violent behavior, being female, and being black were negatively associated with adolescents’ GPA, and school connectedness and parental involvement in activities were positively associated with adolescents’ GPA. Other predictors associated with greater GPA included self-esteem, alcohol problems, and being Asian, and inhalants use. Some other predictors associated with lower adolescents’ GPA included marijuana use, being Hispanic, cigarette availability at home, older age, and parents being on public assistance. The 22 predictor variables selected in the model accounted for 21.8% of the variance in the adolescents’ GPA response variable.