Python - Simple Linear Regression¶
Linear regression is the most widely used statistical technique to describe and assess the relationship between variables. Simple linear regression is a method that evaluates the linear relationship between two quantitative variables, quantifying the connection and assessing its significance.
Simple linear regression is an asymmetric method where:
One variable is treated as the response (or dependent) variable, which is the variable being explained. It is represented on the y-axis.
The other variable serves as the explanatory (or independent) variable, represented on the x-axis.
Simple linear regression helps determine whether a linear relationship exists between two variables and measures the strength of this relationship. It specifically tests whether the variables are linearly related, making linearity a key assumption in this analysis.
The strength of linear regression lies in its ability to quantify how much the response variable changes when the explanatory variable increases by one unit.
This concept is crucial in linear regression and addresses questions such as:
- Is there a relationship between advertising spending and sales during a specific time frame?
- Does the number of years of education translate into higher earnings in a first job?
- Will raising tobacco taxes lead to reduced consumption?
- How does the price of an apartment vary with its size?
- Does reaction time to a stimulus differ based on gender?
import pandas as pd
import statsmodels.api as sm
import seaborn as sns
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
URL = "https://raw.githubusercontent.com/YuriiHamii/python_data_analysis/main/datasets/pandas_practise/auto_lmr.csv"
df = pd.read_csv(URL)
df = df.replace([np.inf, -np.inf], np.nan)
df.head()
MODELYEAR | MAKE | MODEL | VEHICLECLASS | ENGINESIZE | CYLINDERS | TRANSMISSION | FUELTYPE | FUELCONSUMPTION_CITY | FUELCONSUMPTION_HWY | FUELCONSUMPTION_COMB | FUELCONSUMPTION_COMB_MPG | CO2EMISSIONS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2014 | ACURA | ILX | COMPACT | 2.0 | 4 | AS5 | Z | 9.9 | 6.7 | 8.5 | 33 | 196 |
1 | 2014 | ACURA | ILX | COMPACT | 2.4 | 4 | M6 | Z | 11.2 | 7.7 | 9.6 | 29 | 221 |
2 | 2014 | ACURA | ILX HYBRID | COMPACT | 1.5 | 4 | AV7 | Z | 6.0 | 5.8 | 5.9 | 48 | 136 |
3 | 2014 | ACURA | MDX 4WD | SUV - SMALL | 3.5 | 6 | AS6 | Z | 12.7 | 9.1 | 11.1 | 25 | 255 |
4 | 2014 | ACURA | RDX AWD | SUV - SMALL | 3.5 | 6 | AS6 | Z | 12.1 | 8.7 | 10.6 | 27 | 244 |
We have downloaded a fuel consumption dataset, auto_lmr.csv
, which contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale in USA.
- MODELYEAR e.g. 2014
- MAKE e.g. Acura
- MODEL e.g. ILX
- VEHICLE CLASS e.g. SUV
- ENGINE SIZE e.g. 4.7
- CYLINDERS e.g 6
- TRANSMISSION e.g. A6
- FUEL CONSUMPTION in CITY(L/100 km) e.g. 9.9
- FUEL CONSUMPTION in HWY (L/100 km) e.g. 8.9
- FUEL CONSUMPTION COMB (L/100 km) e.g. 9.2
- CO2 EMISSIONS (g/km) e.g. 182 --> low --> 0
Linear regression with a single explanatory variable¶
There are several methods to perform linear regression in Python with Statsmodels library.While it offers more features than we require, it delivers well-organized results, much like SAS Enterprise Guide.
The method we'll use to build linear regression models in the Statsmodels library is OLS(), which stands for "ordinary least squares." This technique finds the best-fitting line by minimizing the squared differences between the observed and predicted values, known as residuals. The term "ordinary" suggests that this form of linear regression is just one of many approaches. Beyond this, there are various specialized regression techniques designed for different problems or situations.
Preparing the data¶
Recall the general format of the linear regression equation: $Y = \beta_0 + \beta_1 X_1 + ... + \beta_n X_n$, where $Y$ is the value of the response variable and $X_i$ is the value of the explanatory variable(s).
When we consider this equation in matrix form, we realize that Y is a 1-dimensional matrix: it's essentially a single column (or array or vector) of numbers. In our context, this vector represents the carbon dioxide emissions for new light-duty vehicles for retail sale. On the right-hand side, we have a 2-dimensional matrix: one column for our X variable and another for the constant. While we typically don't think of the constant as a separate data column, the Statsmodels library does, which is why it comes up in this discussion.
Creating a linear regression model in Statsmodels thus requires the following steps:
- Import the Statsmodels library
- Define Y and X matrices. This is optional, but it keeps the
OLS()
call easier to read - Add a constant column to the X matrix
- Call
OLS()
to define the model - Call
fit()
to actually estimate the model parameters using the data set (fit the line) - Display the results
Let's start with the first three steps:
Y = df['CO2EMISSIONS']
X = df['ENGINESIZE']
print("Y: \n", Y.head())
print("X: \n", X.head())
Y: 0 196 1 221 2 136 3 255 4 244 Name: CO2EMISSIONS, dtype: int64 X: 0 2.0 1 2.4 2 1.5 3 3.5 4 3.5 Name: ENGINESIZE, dtype: float64
We see above that X is a single column of numbers (engine size auto). The numbers on the left are just the Python index (very row in a Python array has a row number, or index).
Adding a column for the constant¶
We can add another column for the regression constant using Statsmodels add_constant()
method:
X = sm.add_constant(X)
X.head()
const | ENGINESIZE | |
---|---|---|
0 | 1.0 | 2.0 |
1 | 1.0 | 2.4 |
2 | 1.0 | 1.5 |
3 | 1.0 | 3.5 |
4 | 1.0 | 3.5 |
Notice the difference: the X matrix has been augmented with a column of 1s called "const". To see why, recall the point of linear regression: to use data to "learn" the parameters of the best-fit line and use the parameters to make predictions. The parameters of a line are its y-intercept and slope. Once we have the y-intercept and slope ($\beta_0$ and $\beta_1$ in the equation above or b and m), we can multiply them by the data in the X matrix to get a prediction for Y.
Written out in words for the first row of our data, we get: The carbon dioxide emissions CO2EMISSIONS = $\beta_0$ x 1 + $\beta_1$ x 3.5
The "const" column simply provides a placeholder—a bunch of 1s to multiply the constant by. So now we understand why we have to run add_constant()
.
Running the model¶
model = sm.OLS(Y, X, missing='drop')
model_result = model.fit()
model_result.summary()
Dep. Variable: | CO2EMISSIONS | R-squared: | 0.764 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.764 |
Method: | Least Squares | F-statistic: | 3451. |
Date: | Sat, 14 Sep 2024 | Prob (F-statistic): | 0.00 |
Time: | 14:39:30 | Log-Likelihood: | -5169.9 |
No. Observations: | 1067 | AIC: | 1.034e+04 |
Df Residuals: | 1065 | BIC: | 1.035e+04 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 125.3041 | 2.420 | 51.779 | 0.000 | 120.556 | 130.053 |
ENGINESIZE | 39.1252 | 0.666 | 58.741 | 0.000 | 37.818 | 40.432 |
Omnibus: | 27.459 | Durbin-Watson: | 0.862 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 54.194 |
Skew: | 0.126 | Prob(JB): | 1.71e-12 |
Kurtosis: | 4.075 | Cond. No. | 9.93 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This is the output from an Ordinary Least Squares (OLS) regression analysis, likely performed using the Statsmodels library. Here’s what each section represents:¶
General Information:
Dep. Variable: CO2EMISSIONS
The dependent variable, or the variable being predicted, is CO2EMISSIONS.
R-squared: 0.764
R-squared measures the proportion of the variance in the dependent variable (CO2EMISSIONS)
explained by the independent variable (ENGINESIZE). A value of 0.764 means that 76.4% of
the variance in CO2 emissions can be explained by the engine size in this model.
Higher values indicate better model fit.
Model: OLS
This shows that the model used for regression is Ordinary Least Squares.
Adj. R-squared: 0.764
Adjusted R-squared accounts for the number of predictors in the model.
Since it’s almost identical to R-squared, it suggests no overfitting and that only
one independent variable (ENGINESIZE) is used.
Method: Least Squares
The method used to estimate the parameters of the model is the least squares method,
which minimizes the sum of squared residuals.
F-statistic: 3451
This statistic tests whether the model is statistically significant overall.
The larger the F-statistic, the more likely it is that at least one predictor
variable is statistically significant.
Prob (F-statistic): 0.00
This is the p-value associated with the F-statistic. A value of 0.00 (essentially zero)
means the model is statistically significant, meaning the relationship between engine size
and CO2 emissions is unlikely due to chance.
Date and Time: Sat, 14 Sep 2024, 15:17:50
The date and time the analysis was run.
Log-Likelihood: -5169.9
This is a measure of the model's goodness of fit. It's used to compare models,
with higher values indicating better fit. Lower (more negative) values suggest worse fit.
No. Observations: 1067
The total number of observations (data points) used in the regression analysis.
AIC: 1.034e+04 (10,340)
The Akaike Information Criterion (AIC) is used for model comparison.
Lower values indicate a better-fitting model, penalizing for the number of predictors.
BIC: 1.035e+04 (10,350)
The Bayesian Information Criterion (BIC) is similar to AIC but penalizes more heavily
for the number of predictors. Like AIC, lower values indicate a better-fitting model.
Df Residuals: 1065
The degrees of freedom of the residuals. This is the number of observations minus the
number of estimated parameters in the model (in this case, 1067 - 2).
Df Model: 1
The degrees of freedom in the model. This equals the number of predictors
(in this case, 1 — the variable ENGINESIZE).
Covariance Type: nonrobust
The type of covariance matrix used in the analysis. "Nonrobust" means the standard
errors are not adjusted for heteroscedasticity.
Coefficients Table:
const: 125.3041
This is the intercept of the regression line.
It represents the predicted value of CO2EMISSIONS when ENGINESIZE is zero.
In this case, the base CO2 emissions level is 125.30 units.
ENGINESIZE: 39.1252
This is the slope of the regression line. It indicates that for each unit increase
in engine size, CO2 emissions increase by 39.13 units.
Standard Error, t-statistic, and P-value:
For each coefficient (intercept and ENGINESIZE), we also have:
std err: The standard error of the coefficient estimates, which quantifies the
uncertainty of these estimates.
const: 2.420
ENGINESIZE: 0.666
t: The t-statistic for each coefficient, which is used to test whether the coefficient
is significantly different from zero.
const: 51.779
ENGINESIZE: 58.741
P>|t|: The p-value associated with the t-statistic. A small p-value (typically < 0.05)
suggests that the corresponding coefficient is statistically significant.
const: 0.000
ENGINESIZE: 0.000 Both values are statistically significant.
[0.025, 0.975]: These are the 95% confidence intervals for the coefficient estimates.
They give a range within which the true value of the coefficient is likely to fall with 95% confidence.
const: [120.556, 130.053]
ENGINESIZE: [37.818, 40.432]
Model Diagnostics:
Omnibus: 27.459
This is a test for the normality of the residuals. A significant value (p-value < 0.05)
indicates non-normality in the residuals.
Prob(Omnibus): 0.000
The p-value for the Omnibus test. In this case, it indicates that the residuals
are not normally distributed.
Jarque-Bera (JB): 54.194
Another test for normality of the residuals.
Similar to Omnibus, it tests for skewness and kurtosis.
Prob(JB): 1.71e-12
The p-value for the Jarque-Bera test, indicating non-normality in the residuals.
Skew: 0.126
This measures the asymmetry of the residual distribution.
A value close to 0 indicates a symmetric distribution.
Kurtosis: 4.075
This measures the "tailedness" of the residual distribution.
A kurtosis of 4.075 suggests heavier tails than a normal distribution.
Durbin-Watson: 0.862
This statistic tests for autocorrelation in the residuals.
A value close to 2 indicates no autocorrelation.
A value of 0.862 suggests positive autocorrelation in the residuals.
Cond. No.: 9.93
This is a condition number that tests for multicollinearity.
A low number (near 10) indicates little to no multicollinearity.
Summary:
This OLS regression model shows that engine size is a significant predictor of CO2 emissions, explaining about 76% of the variation in emissions. However, there may be some issues with residuals' normality and autocorrelation, which could affect the model's assumptions.
Hint:¶
There is no missing data in the DataFrame set, so the missing='drop'
argument above is not required. However, missing data is a fact of life in most data sets. The simplest way to handle it in linear regression is simply to censor (drop) all rows with missing data from the linear regression procedure.¶
Regression diagnostics¶
Statsmodels exposes the residuals. That is, keeps an array containing the difference between the observed values Y and the values predicted by the linear model. A fundamental assumption is that the residuals (or "errors") are random: some big, some some small, some positive, some negative, but overall, the errors are normally distributed around a mean of zero. Anything other than normally distributed residuals indicates a serious problem with the linear model.
Histogram of residuals¶
Plotting residuals in Seaborn is straightforward: we simply pass the histplot()
function the array of residuals from the regression model.
sns.histplot(model_result.resid)
<Axes: ylabel='Count'>
A more helpful way to check for normality is by comparing the estimated density of our data to a normal curve. To do this, we create a normal curve that has the same average (mean) and spread (standard deviation) as our residuals and plot them together.
In Python, we can use a simple trick to get both the mean and standard deviation at once: the fit()
function returns both values for the best-fitting normal distribution.
mu, std = stats.norm.fit(model_result.resid)
mu, std
(1.4490543514657714e-14, 30.762217942126263)
Now we can re-plot the residuals as a kernel density plot and add the normal curve on top, using the same mean and standard deviation.
fig, ax = plt.subplots()
# plot the residuals
sns.histplot(x=model_result.resid, ax=ax, stat="density", linewidth=0, kde=True)
ax.set(title="Distribution of residuals", xlabel="residual")
# plot corresponding normal curve
xmin, xmax = plt.xlim() # the maximum x values from the histogram above
x = np.linspace(xmin, xmax, 100) # generate some x values
p = stats.norm.pdf(x, mu, std) # calculate the y values for the normal curve
sns.lineplot(x=x, y=p, color="orange", ax=ax)
plt.show()
Boxplot of residuals¶
A boxplot is usually more useful when the residuals are far from normal. In this case, we see a fairly good distribution where the mean is close to the median, showing that the data is fairly symmetric.
sns.boxplot(x=model_result.resid, showmeans=True);
Q-Q plot¶
A Q-Q plot is more specific than a histogram or boxplot, so it's easiest to use the regression diagnostic plots from Statsmodels. While these plots aren't as visually appealing as Seaborn's, they are designed mainly for data analysts.
sm.qqplot(model_result.resid, line='s');
Fit plot¶
A fit plot shows the predicted values of the response variable compared to the actual Y values. In a perfect linear regression model, the predicted values would match the observed ones exactly, and all data points in the scatterplot would line up along the 45° diagonal.
The fit plot provided by Statsmodels is okay in the sense that it gives a rough sense of the quality of the model. Since the $R^{2}$ of this model is 0.764, it should come as that the fit model is particularly good.
sm.graphics.plot_fit(model_result,1, vlines=False);
Fit plot in seaborn¶
Creating a more accurate fit plot takes a bit more effort. The key is making sure the observed and predicted axes are identical for the reference line to be at 45°. To do this, I follow these steps:
- Find the min and max values for the observed Y values.
- Predict the Y values.
- Plot the observed versus predicted Y values and save it to an object (like
ax
). - Adjust the plot so that both axes have the same minimum and maximum values.
- Add the 45° reference line to the plot by generating the data for it.
model_result.fittedvalues
0 | |
---|---|
0 | 203.554499 |
1 | 219.204579 |
2 | 183.991899 |
3 | 262.242299 |
4 | 262.242299 |
... | ... |
1062 | 242.679699 |
1063 | 250.504739 |
1064 | 242.679699 |
1065 | 250.504739 |
1066 | 250.504739 |
1067 rows × 1 columns
Y_max = Y.max()
Y_min = Y.min()
ax = sns.scatterplot(x=model_result.fittedvalues, y=Y)
ax.set(ylim=(Y_min, Y_max))
ax.set(xlim=(Y_min, Y_max))
ax.set_xlabel("Predicted value of Strength")
ax.set_ylabel("Observed value of Strength")
X_ref = Y_ref = np.linspace(Y_min, Y_max, 100)
plt.plot(X_ref, Y_ref, color='red', linewidth=1)
plt.show()
A simple linear regression using NumPy's polyfit function, which fits a polynomial to a set of data points.¶
In this case, the polynomial is a straight line (degree 1), which corresponds to linear regression.
X["ENGINESIZE"]: This represents the independent variable (the predictor or feature), which in this case is the engine size.
Y: This is the dependent variable (the target or response), which could represent something like fuel efficiency, emissions, etc.
1: This specifies the degree of the polynomial to fit. Since it's 1, we are fitting a linear equation (a straight line).
slope: This is the coefficient of the X variable, representing the slope of the regression line. It indicates how much the dependent variable (Y) changes for each unit increase in the independent variable (X).
intercept: This is the constant term or y-intercept, representing the value of Y when the independent variable X is 0.
The Linear Regression Equation:¶
The result from np.polyfit allows you to express the relationship between X and Y using the equation of a straight line:
$y$ = $slope$ $×$ $x$ $+$ $intercept$¶
This equation describes the best-fit line that minimizes the differences between the predicted values (from the line) and the actual values of Y.¶
slope, intercept = np.polyfit(X["ENGINESIZE"], Y, 1)
print(f"Equation of a line: y = {intercept:.2f} + x*{slope:.2f}")
# Prediction y - (CO2EMISSIONS) by x = (ENGINESIZE)
x_new = 4
y_pred = slope * x_new + intercept
print(f"Prediction values y by x = {x_new}: {y_pred:.2f}")
pred_point = y_pred
plt.scatter(x_new, pred_point, color='orange', zorder=7)
plt.text(x_new, pred_point, f'y_pred={y_pred:.2f}', fontsize=12, ha='right', color='orange')
# plot
plt.scatter(X["ENGINESIZE"], Y, color='blue', label='data')
plt.plot(X["ENGINESIZE"], slope * X["ENGINESIZE"] + intercept, color='red', label=f'y = {slope:.2f} * x + {intercept:.2f}')
plt.xlabel('x - ENGINESIZE')
plt.ylabel('y - CO2EMISSIONS')
plt.title('Linear Regression')
plt.legend()
plt.grid(True)
plt.show()
Equation of a line: y = 125.30 + x*39.13 Prediction values y by x = 4: 281.80