Multiple Linear Regression: Implementation in R and Python
Table of Contents
Introduction to Multiple Linear Regression
Multiple linear regression is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. It's an extension of simple linear regression and is used when we want to predict the value of a variable based on the value of two or more other variables.
The multiple linear regression model can be represented as:
Where:
- Y is the dependent variable
- X₁, X₂, ..., Xₚ are the independent variables
- β₀ is the y-intercept (constant term)
- β₁, β₂, ..., βₚ are the coefficients for each independent variable
- ε is the error term
Implementation in R
R is a powerful language for statistical computing and graphics. Let's see how to implement multiple linear regression in R using a sample dataset.
# Load required libraries
library(ggplot2)
# Create a sample dataset
set.seed(123)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
y <- 2 + 3*x1 - 0.5*x2 + 1.5*x3 + rnorm(n)
data <- data.frame(y=y, x1=x1, x2=x2, x3=x3)
# View the first few rows of the dataset
head(data)
# Fit the multiple linear regression model
model <- lm(y ~ x1 + x2 + x3, data=data)
# Display the summary of the model
summary(model)
# Visualize the results
par(mfrow=c(2,2))
plot(model)
# Predict using the model
new_data <- data.frame(x1=c(0.5, 1), x2=c(-0.5, 0), x3=c(1, 0.5))
predictions <- predict(model, newdata=new_data)
print(predictions)
The output of the summary function provides important information about the model:
- Coefficients: The estimated values of β₀, β₁, β₂, and β₃
- Standard errors: The standard errors of the coefficient estimates
- t-values and p-values: Used to test the significance of each coefficient
- R-squared: The proportion of variance in the dependent variable explained by the independent variables
- F-statistic: Tests the overall significance of the regression model
Implementation in Python
Python, with libraries like scikit-learn and statsmodels, offers powerful tools for implementing multiple linear regression. Let's see how to do it:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
# Create a sample dataset
np.random.seed(123)
n = 100
x1 = np.random.normal(0, 1, n)
x2 = np.random.normal(0, 1, n)
x3 = np.random.normal(0, 1, n)
y = 2 + 3*x1 - 0.5*x2 + 1.5*x3 + np.random.normal(0, 1, n)
# Create a DataFrame
data = pd.DataFrame({
'y': y,
'x1': x1,
'x2': x2,
'x3': x3
})
# Display the first few rows
print(data.head())
# Split the data into training and testing sets
X = data[['x1', 'x2', 'x3']]
y = data['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Using scikit-learn
model_sklearn = LinearRegression()
model_sklearn.fit(X_train, y_train)
# Print the coefficients
print("Intercept:", model_sklearn.intercept_)
print("Coefficients:", model_sklearn.coef_)
# Make predictions
y_pred = model_sklearn.predict(X_test)
# Evaluate the model
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R² Score:", r2_score(y_test, y_pred))
# Using statsmodels for more detailed statistics
X_with_const = sm.add_constant(X)
model_statsmodels = sm.OLS(y, X_with_const).fit()
print(model_statsmodels.summary())
# Visualize the results
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs Predicted Values')
plt.show()
# Residual plot
residuals = y_test - y_pred
plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='k', linestyle='--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()
Comparing R and Python for Multiple Linear Regression
| Feature | R | Python |
|---|---|---|
| Syntax | Simple and concise with the lm() function |
More verbose, requires multiple libraries |
| Statistical Output | Comprehensive by default | Requires statsmodels for detailed statistics |
| Visualization | Built-in diagnostic plots | More customizable with matplotlib/seaborn |
| Integration | Better for statistical analysis only | Better for integrating with other data science tasks |
| Learning Curve | Steeper for general programming | Gentler for those with programming background |
Assumptions of Multiple Linear Regression
For multiple linear regression to be valid, several assumptions must be met:
- Linearity: The relationship between the independent and dependent variables should be linear.
- Independence: The observations should be independent of each other.
- Homoscedasticity: The residuals should have constant variance at every level of the independent variables.
- Normality: The residuals should be normally distributed.
- No multicollinearity: The independent variables should not be highly correlated with each other.
Conclusion
Multiple linear regression is a powerful technique for predicting a continuous dependent variable based on multiple independent variables. Both R and Python offer robust tools for implementing and analyzing multiple linear regression models, each with its own strengths and weaknesses.
R excels in statistical analysis with its concise syntax and comprehensive output, while Python offers better integration with other data science tasks and more customizable visualizations. The choice between them depends on your specific needs and preferences.
Whether you're using R or Python, understanding the underlying assumptions and interpreting the results correctly is crucial for making valid inferences from your multiple linear regression models.