• Abhishek Singh

Everything that you should know about Linear Regression in python


Linear regression in python

Data is the most powerful weapon in today's world more than 2.5 quintillion bytes of data produced every single day. Over the last two years, we have generated more than 90% of world data.


Every sector uses data as its most important tool to grow its business. Every industry wants to integrate artificial intelligence in their business. Machine learning and Data Science technologies are in big demand, more than 1 million jobs are going to be created in the next 5 years.


Linear regression is one of the basic statistical algorithms in machine learning.

In this tutorial, you will learn about linear regression and it's a various implementation in python.


After reading this blog post, you will be able to answer all of the following questions.



Some other blog post that you may want to read is


What is linear regression

Linear regression is a statistical model that inspects the linear relationship between two (Simple Linear Regression ) or more (Multiple Linear Regression) variables which are dependent variable and independent variables.


The term linear relationship means if one variable(or more) goes up then the other variable goes down and vice-versa is also true.


In the given figure we can see that a linear relationship can be positive or negative.


Let us understand it by an example- In a company XYZ, John salary is directly proportional to the no of hours did he work, This shows that a positive linear relationship between Ramesh salary and no of hours he works.


The price of laptops decreases throughout times, it shows us the negative linear relationship between laptop price and time.



Let us understand a little bit of math behind linear regression The equation of linear regression is

                           Y = mX + b

where

  • Y is an output variable,

  • X is the input variable- the variables we are using to make predictions,

  • m is the slope which determines the effect of x on y,

  • and b is the bias which means how much our prediction is differing from the actual output.


As we have seen in the previous blog post, one of the assumptions of regression is the output variable must be continuous for making a prediction. In the regression, we always trying to minimize our error by finding the "line of best fit". This is the line that tells us about the minimal error between our prediction and the actual output.

In the given plot, we are trying to minimize our length of black lines as close as to the data points. For minimizing our error we use mean squared error also called the residual sum of squares.  


You can check out the following article written by Patrick and his team that will clearly explain the math behind linear regression.


Now let us go the implementation of linear regression in python.


Where do we use Linear regression?

1- It is also used to find customer behavior and on that basis, we will give them some special plans so that the customer won't churn out from our subscription-based service.


2- It is also used to predict House prices based on features like no of rooms, location of the house, no of floors, home size, condition of the house, how old the house is, and many more.


3- Suppose that two campaigns are running on social media like Facebook and Google. By using linear regression we can capture which social media gives us a more impact on our product/ service.




4- Suppose that a logistics company wants to calculate how much money they needed to transport one courier package to their sender locations. The features are like distance, fuel consumption, and package weight. So in this case we can use multiple linear regression.



2- What are four main ways to implement linear regression in Python

Source: DataSpoof

Let us start with the first


Linear regression in python from scratch

Before diving into code first understand our dataset. Our dataset name is the Boston housing dataset. In this dataset, we have to predict the price of the house based on several factors like area, no of rooms, and many others.


To download this dataset, go to the following link.


Let us see the step by step approach to solve this problem using linear regression in python.


1- The first step is to import the libraries and then load the dataset into training and testing.



2- The second step is to convert the string column to float and then split the dataset into training and testing.



3- Find the mean and variance of both the input and the output variables from the training data.



4- Now calculate the covariance- which is used to determine the relationship between two variables.


The formula to calculate covariance is


covariance formaula

5- The fifth step is to estimate the value of both the coefficients in the linear regression.


6- Now the second last step is to define the linear regression model to make the predictions i.e price of the houses. We



7- Now the final step is to evaluate the performance of the algorithm which means how well the algorithm is performed on the unseen dataThere are three common evaluation metrics are


1) Mean absolute error: It is the mean of the differences between the predicted value and the actual value. The mathematical formula to calculate mean absolute error is:

2) Mean squared error: It is the average squared difference between the predicted value and the actual value. The mathematical formula is:

3) Root mean squared error: It is the square root of the mean of the square of all of the error. The mathematical formula is:



In our case, we are using root mean squared error



And then we call the main function and see the output result





Linear regression in Statsmodels

Statsmodels is a Python module that is used for evaluating many different statistical models as well as for conducting statistical tests and statistical data exploration.



Let us see the approach to solve this linear regression problem using the statsmodel package.


1- Import the required libraries and load the Boston dataset.

2- Now define the input variables and the target variables.

3- The third step is to add the intercept to our input variables.


X = sm.add_constant(X)

4- Now the last step is to fit the old model and make the predictions.


model = sm.OLS(y, X).fit() 
predictions = model.predict(X)

Linear regression using scikit-learn

Now comes the most well-known packages in machine learning known as "sklearn", which is known by almost everyone who is studying machine learning. The sklearn package is used in many algorithms like regression, classification, clustering, and dimensionality reduction.


Now let us see how we solve the linear regression problem using sklearn.


1- The first step is to load the linear_model library from sklearn and load the datasets.



2- The third step is to define the input and output variables.



4- And the last step is to create the instance of a linear regression that can represent the regression model. You can pass several other parameters to Linear Regression like


fit_intercept: It takes boolean variables. If we want to calculate the intercept then it will be True else False. The default value is True.


normalize: It takes boolean variables. If we want to normalize the input variables set it to True else False. The default value is False.


copy_X: It takes boolean variables. If we want to copy the input variable set it to True else False. The default value is True.


n_jobs: It takes integer values. If you specify n_jobs to -1, it will use all cores. If it is set to 1 or 2, it will use one or two cores only.


In our case, we use default values. And then we fit the model.


lm = linear_model.LinearRegression()
model = lm.fit(X,y)


Linear regression using Scipy

Steps that are involved to perform linear regression using scipy:


1-The first step is to import the stats library from the Scipy package.


2- The second step is to define our input variables and the output variables


3- Now we perform the linear regression using the linregress function. You can pass parameters like input and output variables i.e X and Y.


4- Now we print the R squared value(coefficient of determination). After that, we Plot the data along with the fitted line.


plot of linear regression using scipy

What is Multiple linear regression

In the linear regression, we deal with two variables and in multiple linear regression, we deal with more than two variables.


For example- Rainfall depends upon many parameters including pressure, temperature, wind speed, humidity, and many more.


The mathematical equation which represents the MLR is


Mathematical equation of multiple linear regression

How to select the best features in the dataset

Suppose in a dataset we have more than 100's of features. So the question arises is how do we select which features have the strongest impact on the target variable.


So to solve this problem we use the feature selection method.


Feature Selection is a technique that is used to find those variables which have their most impact on the target variable.


Some of the main techniques are


1) Chi-squared method

We calculate the Chi-squared value between each feature and target and then select the desired number of features with the best chi-square(χ2) score.


The formula to calculate χ2 is

Oi represents no of observations in class i

Ei represents no of expected observations in class i



2) Backward elimination

This method involves the removal of variables from the model.


Steps to perform backward elimination

  1. The first step is to fit the model with all independent variables.

  2. The second step is to choose the threshold value let say p= 5%

  3. The next step is to remove all the independent variables whose P-value is greater than 5% otherwise finish.

  4. Fit the model with all remaining variables


Implementation of multiple linear regression in python

The dataset is taken from Kaggle. This dataset contains 7 different fish species in fish market sales. The columns are fish species, weight, length, height, and width.



Step 1- The first step is to load all the required libraries


  • numpy: It is used for mathematical operation

  • pandas: It is used to load the dataset

  • Matplotlib and Seaborn: It is a visualization tool to plot graphs

  • train_test_split: It is used to separate our data into training and testing subsets


Step 2- The second step is to load the dataset using pandas read_csv function.


Step 3- The third step is to check the dataset shape, data types, and count any missing values in the dataset. In our case, there are no missing values in the dataset.


Step 4- The next step is to count all the unique species in the column and then make a bar plot showing no of fishes in each class.


Step 5- Now we encode all the categorical columns of the dataset and make a correlation table. After that, we use a heatmap to find the relationship between features.


In the below heat map we know that the weight is highly correlated upon Length1, Length2, and Length3.




Step 6- Now we check for any outliers (An outlier is a point or group of points that are different from other points) that are present in the dataset and remove them. By removing outliers we get a more accurate model. Hence it’s a good idea to remove them.


We are going to use the Inter-quartile range( IQR ) to detect outliers. And then we can visualize it using BoxPlot. As you can show in the figure, some points are outside the box termed as outliers.



Using Boxplot to detect outliers

Similarly, check outliers for other columns also using the above technique.

Step 7- Now we define our input and target variables. And then we split the dataset into training and testing.


Step 8- Now we Fitting the Multiple Linear Regression model in the Training set and predict the test set results.



Step 9- Now we perform feature selection techniques. We are going to use backward elimination.


Step 10- Now we fit the Multiple Linear Regression in the Optimal training set and predict the test set result.



OLS Summary


Wrap up this session

In this tutorial, we have learned about the most basic machine learning algorithm which is linear regression, and what are the various ways to implement linear regression.


Then we learned how to implement linear regression using 4 different ways:

  1. Linear regression in python from scratch

  2. Linear regression in python using statsmodel

  3. Linear regression in python using Scikit-learn

  4. Linear regression in python using Scipy

We have also learned where to use linear regression, what is multiple linear regression and how to implement it in python using sklearn.



Tell me in the comments which method do you like the most. And if you have any problem regarding implementation feel free to drop a comment, I will reply within

24 hrs.


So if you like this blog post, please like it and subscribe to our data spoof community to get real-time updates. You can follow our Facebook page to get notification whenever we upload any post so you can never miss any update from us.





1,250 views