You are currently viewing Everything that you should know about Linear Regression in python

Everything that you should know about Linear Regression in python

Data is the most powerful weapon in today’s world; more than 2.5 quintillion bytes of data are produced every single day. Over the last two years, we have generated more than 90% of the world’s data.

Every sector uses data as its most important tool to grow its business. Every industry wants to integrate artificial intelligence into their business. Machine learning and data science technologies are in big demand, more than 1 million jobs will be created in the next 5 years.

Linear regression is one of the basic statistical algorithms in machine learning. In this tutorial, you will learn about linear regression and its various implementations in Python.

After reading this blog post, you will be able to answer all of the following questions.

What is linear regression

Linear regression is a statistical model that examines the linear relationship between two (simple linear regression) or more (multiple linear regression) variables, which are dependent variables and independent variables.

The term “linear relationship” means that if one variable (or more) goes up, the other variable goes down, and vice versa is also true. In the given figure, we can see that a linear relationship can be positive or negative.

Graph for positive and negate correlation Source: slide player.com
The positive and negative relationship in linear regression model

Let us take an example: In company XYZ, John’s salary is directly proportional to the number of hours he worked, indicating that Ramesh’s salary has a positive linear relationship with the number of hours he works.

The price of laptops decreases over time, showing us the negative linear relationship between laptop price and time.

Let us understand a little bit of math behind linear regression

The equation of linear regression is

                           Y = mX + b

where

  • Y is an output variable,
  • X is the input variable- the variables we are using to make predictions,
  • m is the slope that determines the effect of x on y,
  • and b is the bias which means how much our prediction is differing from the actual output.

As we have seen in the previous blog post, one of the assumptions of regression is the output variable must be continuous for making a prediction.

In the regression, we always try to minimize our error by finding the “line of best fit”. This is the line that tells us about the minimal error between our prediction and the actual output.

line of best fit

In the given plot, we are trying to minimize the length of black lines as close as possible to the data points. To minimize our error, we use the mean squared error, also called the residual sum of squares. 

You can check out the following article written by Patrick and his team that will clearly explain the math behind linear regression.

Now let us go through the implementation of linear regression in Python.

Where do we use Linear regression?

1- It is also used to find customer behavior and on that basis, we will give them some special plans so that the customer won’t churn out from our subscription-based service.

2- It is also used to predict House prices based on features like no of rooms, location of the house, no of floors, home size, condition of the house, how old the house is, and many more.

3- Suppose that two campaigns are running on social media like Facebook and Google. By using linear regression we can capture which social media gives us a more impact on our product/ service.

4- Suppose that a logistics company wants to calculate how much money they need to transport one courier package to their sender locations. The features are distance, fuel consumption, and package weight. In this case, multiple linear regression can be used.

What are the four main ways to implement linear regression in Python

4 ways to implement Linear Regression in Python

Let us start with the first

Linear regression in python from scratch

Before diving into code, first, understand our dataset. Our dataset is the Boston housing dataset. In this dataset, we have to predict the price of the house based on several factors, like area, number of rooms, and many others.

To download this dataset, go to the following link.

Let us see the step-by-step approach to solving this problem using linear regression in Python.

1- The first step is to import the libraries and then load the dataset into training and testing.

from random import seed
from random import randrange
from csv import reader
from math import sqrt

def load_csv(filename):
    dataset = list()
    with open(filename, 'r') as file:
         csv_reader = reader(file)
      for row in csv_reader:
           if not row:
                continue
      dataset.append(row)
    return dataset

2- The second step is to convert the string column to float and then split the dataset into training and testing.

def str_column_to_float(dataset, column):
 for row in dataset:
    row[column] = float(row[column].strip())
 
def train_test_split(dataset, split):
  train = list()
  train_size = split * len(dataset)
  dataset_copy = list(dataset)
 while len(train) < train_size:
    index = randrange(len(dataset_copy))
    train.append(dataset_copy.pop(index))
 return train, dataset_copy

3- Find the mean and variance of both the input and the output variables from the training data.

def mean(values):
 return sum(values) / float(len(values))
 
def variance(values, mean):
 return sum([(x-mean)**2 for x in values])

4- Now calculate the covariance– which is used to determine the relationship between two variables. The formula to calculate covariance is

covariance formula
# Calculate covariance between x and y
def covariance(x, mean_x, y, mean_y):
  covar = 0.0
  for i in range(len(x)):
      covar += (x[i] - mean_x) * (y[i] - mean_y)
  return covar

5- The fifth step is to estimate the value of both coefficients in the linear regression.

def coefficients(dataset):
  x = [row[0] for row in dataset]
  y = [row[1] for row in dataset]
  x_mean, y_mean = mean(x), mean(y)
  b1 = covariance(x, x_mean, y, y_mean) / variance(x, x_mean)
  b0 = y_mean - b1 * x_mean
 return [b0, b1]

6- Now the second last step is to define the linear regression model to make the predictions i.e price of the houses.

def simple_linear_regression(train, test):
  predictions = list()
  b0, b1 = coefficients(train)
 for row in test:
    yhat = b0 + b1 * row[0]
    predictions.append(yhat)
 return predictions

7- Now the final step is to evaluate the performance of the algorithm which means how well the algorithm is performed on the unseen dataThere are three common evaluation metrics are

1) Mean absolute error: It is the mean of the differences between the predicted value and the actual value. The mathematical formula to calculate mean absolute error is:

Mean Absolute Error Formula

2) Mean squared error: It is the average squared difference between the predicted value and the actual value. The mathematical formula is:

Mean squared error formula

3) Root mean squared error: It is the square root of the mean of the square of all of the errors. The mathematical formula is:

Root mean squared error formula

In our case, we are using root mean squared error

def rmse_metric(actual, predicted):
  sum_error = 0.0
 for i in range(len(actual)):
    prediction_error = predicted[i] - actual[i]
    sum_error += (prediction_error ** 2)
  mean_error = sum_error / float(len(actual))
 return sqrt(mean_error)

And then we call the main function and see the output result.

filename = 'boston.csv'
dataset = load_csv(filename)
for i in range(len(dataset[0])):
  str_column_to_float(dataset, i)

split = 0.6
rmse = evaluate_algorithm(dataset, simple)

Linear regression in Statsmodels

Statsmodels is a Python module that is used for evaluating many different statistical models as well as for conducting statistical tests and statistical data exploration.

Let us see the approach to solving this linear regression problem using the statsmodel package.

1- Import the required libraries and load the Boston dataset.

import statsmodels.api as sm
from sklearn import datasets 
import numpy as np
import pandas as pd

data = datasets.load_boston()

2- Now define the input variables and the target variables.

df = pd.DataFrame(data.data, columns=data.feature_names)

target = pd.DataFrame(data.target, columns=["MEDV"])


X = df["RM"]
y = target["MEDV"]

3- The third step is to add the intercept to our input variables.

X = sm.add_constant(X)

4- Now the last step is to fit the old model and make the predictions.

model = sm.OLS(y, X).fit() 
predictions = model.predict(X)

Linear regression using scikit-learn

Now comes the most well-known package in machine learning known as “sklearn,” which is known by almost everyone who is studying machine learning. The sklearn package is used in many algorithms like regression, classification, clustering, and dimensionality reduction.

Now let us see how we solve the linear regression problem using sklearn.

1- The first step is to load the linear_model library from sklearn and load the datasets.

from sklearn import linear_model
from sklearn import datasets
import numpy as np
import pandas as pd

data = datasets.load_boston()

2- The third step is to define the input and output variables.

data = datasets.load_boston()

df = pd.DataFrame(data.data, columns=data.feature_names)

target = pd.DataFrame(data.target, columns=["MEDV"])

X = df
y = target["MEDV"]

4- And the last step is to create the instance of a linear regression that can represent the regression model. You can pass several other parameters to Linear Regression like

fit_intercept: It takes boolean variables. If we want to calculate the intercept then it will be True else False. The default value is True.

normalize: It takes boolean variables. If we want to normalize the input variables set it to True else False. The default value is False.

copy_X: It takes boolean variables. If we want to copy the input variable set it to True else False. The default value is True.

n_jobs: It takes integer values. If you specify n_jobs to -1, it will use all cores. If it is set to 1 or 2, it will use one or two cores only.

In our case, we use default values. And then we fit the model.

lm = linear_model.LinearRegression()
model = lm.fit(X,y)

Linear regression using Scipy

Steps that are involved to perform linear regression using scipy:

1-The first step is to import the stats library from the Scipy package.

2- The second step is to define our input variables and the output variables

3- Now we perform the linear regression using the linregress function. You can pass parameters like input and output variables, i.e., X and Y.

4- Now we print the R squared value(coefficient of determination). After that, we Plot the data along with the fitted line.

from scipy import stats

xs = x.iloc[:,0]
ys = y.iloc[:,0]
#xs = np.concatenate((np.ones(len(x)).reshape(-1,1), x), axis=1)

slope, intercept, r_value, p_value, std_err = stats.linregress(xs, ys)


print('Slope = {} and Intercept = {}'.format(slope, intercept))
print('y = x({}) + {}'.format(slope, intercept))


#Plot the linear fit using the slop and intercept values from scipy

plt.figure(figsize=(10,6))
plt.title('$\\theta_0$ = {} , $\\theta_1$ = {}'.format(intercept, slope))
plt.scatter(xs,y, marker='o', color='green')
plt.plot(xs, np.dot(x, slope), 'r')
plot of linear regression using scipy

Wrap up this session

In this tutorial, we have learned about the most basic machine learning algorithm which is linear regression, and what are the various ways to implement linear regression.

Then we learned how to implement linear regression using 4 different ways:

  1. Linear regression in python from scratch
  2. Linear regression in python using statsmodel
  3. Linear regression in python using Scikit-learn
  4. Linear regression in python using Scipy

We have also learned where to use linear regression, and how to implement it in Python using sklearn.

Tell me in the comments which method you like the most. And if you have any problems regarding implementation, feel free to drop a comment, and I will reply within 24 hours.

If you like the article and would like to support me, make sure to: