Multiple Linear Regression: Implementation in R and Python

Published on: March 27, 2025 Data Science Statistics Programming

Introduction to Multiple Linear Regression
Implementation in R
Implementation in Python
Comparing R and Python
Assumptions of Multiple Linear Regression
Conclusion

Introduction to Multiple Linear Regression

Multiple linear regression is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. It's an extension of simple linear regression and is used when we want to predict the value of a variable based on the value of two or more other variables.

The multiple linear regression model can be represented as:

Y = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ + ε

Where:

Y is the dependent variable
X₁, X₂, ..., Xₚ are the independent variables
β₀ is the y-intercept (constant term)
β₁, β₂, ..., βₚ are the coefficients for each independent variable
ε is the error term

Implementation in R

R is a powerful language for statistical computing and graphics. Let's see how to implement multiple linear regression in R using a sample dataset.


# Load required libraries
library(ggplot2)

# Create a sample dataset
set.seed(123)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
y <- 2 + 3*x1 - 0.5*x2 + 1.5*x3 + rnorm(n)
data <- data.frame(y=y, x1=x1, x2=x2, x3=x3)

# View the first few rows of the dataset
head(data)

# Fit the multiple linear regression model
model <- lm(y ~ x1 + x2 + x3, data=data)

# Display the summary of the model
summary(model)

# Visualize the results
par(mfrow=c(2,2))
plot(model)

# Predict using the model
new_data <- data.frame(x1=c(0.5, 1), x2=c(-0.5, 0), x3=c(1, 0.5))
predictions <- predict(model, newdata=new_data)
print(predictions)

The output of the summary function provides important information about the model:

Coefficients: The estimated values of β₀, β₁, β₂, and β₃
Standard errors: The standard errors of the coefficient estimates
t-values and p-values: Used to test the significance of each coefficient
R-squared: The proportion of variance in the dependent variable explained by the independent variables
F-statistic: Tests the overall significance of the regression model

Implementation in Python

Python, with libraries like scikit-learn and statsmodels, offers powerful tools for implementing multiple linear regression. Let's see how to do it:


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm

# Create a sample dataset
np.random.seed(123)
n = 100
x1 = np.random.normal(0, 1, n)
x2 = np.random.normal(0, 1, n)
x3 = np.random.normal(0, 1, n)
y = 2 + 3*x1 - 0.5*x2 + 1.5*x3 + np.random.normal(0, 1, n)

# Create a DataFrame
data = pd.DataFrame({
    'y': y,
    'x1': x1,
    'x2': x2,
    'x3': x3
})

# Display the first few rows
print(data.head())

# Split the data into training and testing sets
X = data[['x1', 'x2', 'x3']]
y = data['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Using scikit-learn
model_sklearn = LinearRegression()
model_sklearn.fit(X_train, y_train)

# Print the coefficients
print("Intercept:", model_sklearn.intercept_)
print("Coefficients:", model_sklearn.coef_)

# Make predictions
y_pred = model_sklearn.predict(X_test)

# Evaluate the model
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R² Score:", r2_score(y_test, y_pred))

# Using statsmodels for more detailed statistics
X_with_const = sm.add_constant(X)
model_statsmodels = sm.OLS(y, X_with_const).fit()
print(model_statsmodels.summary())

# Visualize the results
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs Predicted Values')
plt.show()

# Residual plot
residuals = y_test - y_pred
plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='k', linestyle='--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

Comparing R and Python for Multiple Linear Regression

Feature	R	Python
Syntax	Simple and concise with the `lm()` function	More verbose, requires multiple libraries
Statistical Output	Comprehensive by default	Requires statsmodels for detailed statistics
Visualization	Built-in diagnostic plots	More customizable with matplotlib/seaborn
Integration	Better for statistical analysis only	Better for integrating with other data science tasks
Learning Curve	Steeper for general programming	Gentler for those with programming background

Assumptions of Multiple Linear Regression

For multiple linear regression to be valid, several assumptions must be met:

Linearity: The relationship between the independent and dependent variables should be linear.
Independence: The observations should be independent of each other.
Homoscedasticity: The residuals should have constant variance at every level of the independent variables.
Normality: The residuals should be normally distributed.
No multicollinearity: The independent variables should not be highly correlated with each other.

Conclusion

Multiple linear regression is a powerful technique for predicting a continuous dependent variable based on multiple independent variables. Both R and Python offer robust tools for implementing and analyzing multiple linear regression models, each with its own strengths and weaknesses.

R excels in statistical analysis with its concise syntax and comprehensive output, while Python offers better integration with other data science tasks and more customizable visualizations. The choice between them depends on your specific needs and preferences.

Whether you're using R or Python, understanding the underlying assumptions and interpreting the results correctly is crucial for making valid inferences from your multiple linear regression models.