Multiple Regression and Regression Diagnostics with Python

Objective:

Perform a multivariate regression modeling to identify indicators associated with breast cancer, and conduct a regression diagnostic of our model. Indicators of interest are: urbanization rate, life expectancy, CO2 emission, income per person, alcohol consumption and employment rate. The dependent variable is breast cancer rate, which is the 2002 breast cancer new cases per 100,000 female.

Data management:

All indicators of interest were centralized around their means.

Python code:

The Python code written to perform this analysis is accessible here.

Outputs:

Figure 1: Scatter plot for the Association Between Residential electricity and Breast Cancers Rate

qt_img87853556039684

Output 1: Univariate regression analysis of the associate between urbanization rate and breast cancer rate.

Univariate Urban

Output 2: Multivariate regression analysis of the association between urban rate, life expectancy, income, co2 emissions, alcohol consumption, employment and breast cancer rate.

Multiple Reg

Output 3: Multivariate regression analysis of the association between income, alcohol consumption and breast cancer rate. Only the two significant variables in the previous model were kept in this model.

Alcohol+Income

Regression diagnostics:

Regression diagnostics of the model assessing the association between income, alcohol consumption and breast cancer rate.

Figure 2: Regression diagnostics plots (q-q plot, standardized residuals plot, and leverage plot)

qt_img90610925043716qt_img90967407329284qt_img94459215740932

Comment:

Regression analysis:

The multivariate regression analysis shows that, urbanization rate which was significantly associated with breast cancers rate in the univariate model, is no more significant, after controlling for potential confounding factors (life expectancy, income, co2 emissions, alcohol consumption, employment and breast cancer rate).

Life expectancy, co2 emissions, and employment also weren’t significant in the multivariate analysis (p-values > 0.05). This shows that income and alcohol consumption confound the relationship between life expectancy, co2 emissions, employment, and breast cancer rate.  Only income, and alcohol consumption were included in the last multivariate regression model.

The analysis of the association between income, alcohol consumption and breast cancer rate shows that income (beta = 0.0014, p < 0.001) is significantly associated with breast cancer rate, after controlling for alcohol consumption. Also, alcohol consumption (beta = 1.4752, p < 0.001) is significantly associated with breast cancer rate after controlling for income. About 62% (R-squared = 0.622) of the variability in breast cancer rate is explained by income and alcohol consumption.

Q-q plot:

Some residuals don’t follow the normal line. Thus, the linear association observed in the scatter plot may not be fully estimated by income and alcohol consumption.

Standardized residuals for all observations:

Most residuals are in around 1 standard deviation. However, more that 5% of them are located above 2 standard deviation. Two (2) extreme outliers are observed. This shows that, the model is relatedly poor and could be improved.

Leverage plot:

Some residuals are outside of -2 and 2 range and are higher than the average leverage. This is consistent with the results of the standardized residuals plot.

Advertisements

One thought on “Multiple Regression and Regression Diagnostics with Python

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s