You are currently viewing Mastering the Art of Multiple Linear Regression in Python

Mastering the Art of Multiple Linear Regression in Python

Loading

Ever wondered how to make predictions based on a variety of variables? This is the official guide. We can help you with everything from deciphering intricate relationships in your data to doing practical Python code. This step-by-step lesson is ideal for anybody interested in data science, regardless of experience level or coding knowledge. Prepare to open up new possibilities for data interpretation and analysis. Let’s begin this thrilling journey together!

What is Multiple linear regression?

In linear regression, we deal with a single independent variable, whereas, in multiple linear regression, we deal with more than two variables.

For example- Rainfall depends upon many parameters including pressure, temperature, wind speed, humidity, and many more.

The mathematical equation which represents the MLR is

 mathematical equation of multiple linear regression
The equation of multiple linear regression

Here y is a dependent variable, bo,b1,b2, and bn are the coefficients of the regression model, and x1, x2,…..xn represent the independent variables.

Implementation of multiple linear regression in Python

The dataset is taken from Kaggle. This dataset contains 7 different fish species in fish market sales. The columns are fish species, weight, length, height, and width.

Step 1– The first step is to load all the required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
from sklearn.model_selection import train_test_split
  • NumPy: It is used for mathematical operation
  • pandas: It is used to load the dataset
  • Matplotlib and Seaborn: It is a visualization tool to plot graphs
  • train_test_split: It is used to separate our data into training and testing subsets

Step 2– The second step is to load the dataset using the pandas read_csv function.

# Load the dataset
data = pd.read_csv('Fish.csv')
data.head()

Step 3- The third step is to check the dataset shape, and data types, and count any missing values in the dataset. In our case, there are no missing values in the dataset.

# Displaying the no of rows and columns
print(data.shape)

# Print datatypes of each column
print(data.dtypes)

#checking for missing values
print(data.isnull().sum())

The shape of the dataset is 159 rows and 7 columns. All the columns are of float type except the Species column. The output shows that data does not contains any missing values.

Step 4- The next step is to count all the unique species in the column and then make a bar plot showing the number of fish in each class.

#Count no of unique species
data['Species'].value_counts()

#plotting it
sns.countplot(x=data.index, y=data['Species'],palette='Set1')
plt.title("Types of Species")
plt.xlabel('Species')
plt.ylabel('Counts of Species')
plt.show()
Bar plot of Species

Step 5– Now we encode all the categorical columns of the dataset and make a correlation table. After that, we use a heatmap to find the relationship between features.

# Encoding categorical columns
for col_name in data.columns:
    if(data[col_name].dtype == 'object'):
        data[col_name]= data[col_name].astype('category')
        data[col_name] = data[col_name].cat.codes

Next, the heatmap is used to find the correlation between the variables.

# Correlation of the Variables:        
data.corr()
sns.heatmap(data.corr(), annot=True);

The heatmap is shown down below,

Heatmap

Here is a brief interpretation of correlation coefficients.

  • All other factors have a negative impact on species. This implies that all other parameters normally drop when the species code increases.
  • Weight, Length1, Length2, Length3, and Width all have a significant positive association. This shows that larger, broader fish are more common among heavier fish.
  • Lengths 1, 2, and 3 have a lot of positive correlation with one another. In other words, these three variables—which probably reflect various metrics of fish length—tend to rise or fall together.
  • Weight, Length1, Length2, and Length3 are all positively associated with Height and Width, indicating that taller and broader fish also tend to be heavier and longer.

Step 6– Now we check for any outliers (An outlier is a point or group of points that are different from other points) that are present in the dataset and remove them. By removing outliers we get a more accurate model. Hence it’s a good idea to remove them.

We are going to use the interquartile range (IQR ) to detect outliers. And then we can visualize it using BoxPlot. As you can show in the figure, some points are outside the box and are termed outliers.

sns.boxplot(x=data['Weight'])

dfw = data['Weight']
dfw_Q1 = dfw.quantile(0.25)
dfw_Q3 = dfw.quantile(0.75)
dfw_IQR = dfw_Q3 - dfw_Q1
dfw_lowerend = dfw_Q1 - (1.5 * dfw_IQR)
dfw_upperend = dfw_Q3 + (1.5 * dfw_IQR)

dfw_outliers = dfw[(dfw < dfw_lowerend) | (dfw > dfw_upperend)]
dfw_outliers
Using Boxplot to detect outliers
Boxplot for checking outliers

Similarly, check outliers for other columns also using the above technique.

Step 7- Now we define our input and target variables. And then we split the dataset into training and testing.

#defining input and target variables
X= data.loc[:, data.columns != 'Weight']
y= data['Weight']

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_Train, X_Test, Y_Train, Y_Test = train_test_split(X,y,test_size = 0.2,random_state = 0)

After that, we Fit the Multiple Linear Regression model in the Training set and predict the test set results.

# Fitting the Multiple Linear Regression in the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_Train, Y_Train)

# Predicting the Test set results
Y_Pred = regressor.predict(X_Test)
# Evaluation
print("R² Score:", r2_score(Y_Test, Y_Pred))
print("RMSE:", np.sqrt(mean_squared_error(Y_Test, Y_Pred)))
  • R² Score: 0.8734952284180402
  • RMSE: 155.17817227812904

How to select the best features in the dataset

Suppose in a dataset we have more than hundreds of features. So the question arises is how do we select which features have the strongest impact on the target variable?

So to solve this problem we use the feature selection method.

Feature Selection is a technique that is used to find those variables which have the most impact on the target variable.

Backward selection

This method involves the removal of variables from the model. Think of this like an entrance exam

  • Initially 1000 students appeared in the exam.
  • After first round 900 students are eliminated (let us say threshold is 85%)
  • In second round of the exam 50 student are select and other 50 are eliminated.
  • This process is repeated until the best candidates are found.

Steps to perform backward selection

  1. The first step is to fit the model with all independent variables.
  2. The second step is to choose the threshold value let’s say p= 5% aka 0.05
  3. The next step is to remove all the independent variables whose P-value is more significant than 5% otherwise finish.
  4. Fit the model with all remaining variables

Implementation of backward selection

Next add a constant column for statsmodels

# Add constant column for statsmodels
X = sm.add_constant(X)

Step 8- Now we perform feature selection techniques. We are going to use backward elimination. After that, we fit the Multiple Linear Regression in the Optimal training set and predict the test set result.

# Building the optimal model using Backward Elimination

def backward_elimination(X, y, significance_level=0.05):
    features = list(X.columns)
    while True:
        X_ols = sm.OLS(y, X[features]).fit()
        p_values = X_ols.pvalues
        max_p_value = p_values.max()
        if max_p_value > significance_level:
            excluded_feature = p_values.idxmax()
            print(f"Dropping '{excluded_feature}' with p-value {max_p_value:.4f}")
            features.remove(excluded_feature)
        else:
            break
    return X[features], X_ols

X_optimal, model_summary = backward_elimination(X, y)
print(model_summary.summary())

When the backward elimination is applied the width1 and length1 columns are dropped from the dataset because p values is greater than 0.05.

Now fit the multiple linear regression model and makes the predictions on the test set.

# Train-test split using optimal features
X_train, X_test, y_train, y_test = train_test_split(X_optimal, y, test_size=0.2, random_state=0)

# Fit and predict with optimal features
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

After that evaluate the optimal model using r2 score and the root mean squared error.

# Evaluation
print("R² Score:", r2_score(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
  • R² Score: 0.8748358349768541
  • RMSE: 154.35374898393303

Forward Selection

Select the features from scratch. Imagine you are building a cricket team from scratch aka you start with zero player.

  • Let say you have a pool of 100 player to choose from
  • First, you pick the best player (say the best batsman) (based on threshold).
  • then from remaining player add one player at a time based on the performance.
  • Repeat this process until your team is complete

Implementation of Forward selection

The first step is to implement the forward selection algorithm.

def forward_selection(X, y, significance_level=0.05):
    initial_features = []
    remaining_features = list(X.columns)
    best_features = []

    while remaining_features:
        pvals = {}
        for feature in remaining_features:
            try_features = initial_features + [feature]
            model = sm.OLS(y, X[try_features]).fit()
            pvals[feature] = model.pvalues[feature]
        
        min_p_value = min(pvals.values())
        best_feature = min(pvals, key=pvals.get)

        if min_p_value < significance_level:
            print(f"Adding '{best_feature}' with p-value {min_p_value:.4f}")
            initial_features.append(best_feature)
            remaining_features.remove(best_feature)
            best_features = initial_features.copy()
        else:
            break
    
    final_model = sm.OLS(y, X[best_features]).fit()
    return X[best_features], final_model

Then call this function

X_forward, forward_model = forward_selection(X, y)
print(forward_model.summary())

In the above output first of all it added the length1 feature, and then other features like const, height, species, and length3. All the feature selected one at a time based on the threshold aka p values less than 0.05.

Now split the train test based on optimal features and then fit the multiple linear regression model and makes the predictions on the test set.

# Train-test split using optimal features
X_train, X_test, y_train, y_test = train_test_split(X_forward, y, test_size=0.2, random_state=0)

# Fit and predict with optimal features
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

After that evaluate the optimal model using r2 score and the root mean squared error.

# Evaluation
print("R² Score:", r2_score(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
  • R² Score: 0.8717117970079958
  • RMSE: 156.26817486237906

Stepwise Selection

This is basically the combination of the forward and backward selection algorithm. The algorithm works as follows

  • Starts with no predictors.
  • Adds predictors one-by-one like forward selection.
  • At each step, checks if any included predictors should be removed like backward elimination.
  • This continues until no more predictors can be added or removed based on a significance level.

Implementation of Stepwise selection

The first step is to implement the stepwise selection algorithm.

def stepwise_selection(X, y, entry_threshold=0.05, exit_threshold=0.05):
    included = []
    while True:
        changed = False

        # Forward Step
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [new_column]])).fit()
            new_pvals[new_column] = model.pvalues[new_column]
        
        if not new_pvals.empty:
            best_pval = new_pvals.min()
            if best_pval < entry_threshold:
                best_feature = new_pvals.idxmin()
                included.append(best_feature)
                changed = True
                print(f"Add  '{best_feature}' with p-value {best_pval:.4f}")

        # Backward Step
        if included:
            model = sm.OLS(y, sm.add_constant(X[included])).fit()
            pvalues = model.pvalues.iloc[1:]  # skip intercept
            worst_pval = pvalues.max()
            if worst_pval > exit_threshold:
                worst_feature = pvalues.idxmax()
                included.remove(worst_feature)
                changed = True
                print(f"Drop '{worst_feature}' with p-value {worst_pval:.4f}")
        
        if not changed:
            break

    model = sm.OLS(y, sm.add_constant(X[included])).fit()
    return X[included], model

Then call the stepwise selection function.

X_stepwise, stepwise_model = stepwise_selection(X.drop(columns=['const']), y)
print(stepwise_model.summary())

Now split the train test based on optimal features and then fit the multiple linear regression model and makes the predictions on the test set.

# Train-test split using optimal features
X_train, X_test, y_train, y_test = train_test_split(X_stepwise, y, test_size=0.2, random_state=0)

# Fit and predict with optimal features
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

After that evaluate the optimal model using r2 score and the root mean squared error.

# Evaluation
print("R² Score:", r2_score(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
  • R² Score: 0.8706799060410972
  • RMSE: 156.89539058960813

Applications

Here are the 5 major of the multiple linear regression model

  • Economics: Multiple linear regression is useful in modelling economic patterns, such as how income, wealth, and interest rates affect consumer spending.
  • Healthcare: It’s utilized to research how therapies and lifestyle variables affect patient outcomes. For example, estimating life expectancy depends on illnesses, food, and activity.
  • Real estate: Based on factors including location, size, and age, it helps forecast property values.
  • Marketing: It helps to comprehend how spending on advertising affects sales across various channels.
  • Finance: By taking into account factors like earnings, GDP growth, and other market indicators, MLR aids in the prediction of stock values.

Pros and cons

Next, is the pros and cons of using the multiple linear regression model

ProsCons
1 It can help identify the relative influence of predictors.1. Assumes a linear relationship between the dependent and independent variables.
2. It handles overfitting through regularization techniques.2. Sensitive to outliers and can lead to a poor model.
3. It can analyze multiple predictors simultaneously.3. Multicollinearity can be a serious issue, requiring careful correlation analysis and potentially variable removal.
4. It provides a global model of the dataset, which can be advantageous in understanding relationships between variables.4. Does not handle non-numerical (categorical) variables well without conversion.
5. Easy to implement and interpret the output.5. Assumes no autocorrelation (i.e., the residuals are independent).
6. Can be used to infer causal relationships between variables.6. It assumes the residuals are normally distributed and homoscedastic.
Pros and cons of multiple linear regression model

Conclusion

In conclusion,  Multiple linear regression is a potent statistical technique that enables us to analyze the relationship between a number of independent factors and a dependent variable. You should now have a thorough knowledge of the concepts underlying multiple linear regression and how to use it in Python thanks to this lesson.

The assumptions of linearity, normalcy, and homoscedasticity must all be satisfied for the findings to be of high quality, despite the fact that it has many applications in a wide range of domains. We also emphasized the methods for feature selection and deletion that help in the development of more precise models. Always keep in mind that learning multiple linear regression involves practice and a deeper understanding of its statistical foundations. So keep looking around and have fun analyzing!

If you like the article and would like to support me, make sure to: