Chapter 3- Logistic Regression in PyTorch, Step by Step

Classification is the most used algorithms in machine learning. More than 70% of the problems in data science are classification problems.

There are two categories of classification are:

Binary-class classifier(two-class problem)
Multinomial classification(more than two classes are present in the target variable).

Logistic regression is the most basic algorithm in solving two-class classification problems. Some of the common problems are churn prediction, spam detection, and many more. In the 42 day series of PyTorch, previously we have covered how to perform linear regression in PyTorch. In this blog, we will learn about how to implement logistic regression in pytorch.

Definition of Logistic regression

Logistic Regression is a supervised algorithm in machine learning that is used to predict the probability of a categorical response variable. In logistic regression, the predicted variable is a binary variable that contains data encoded as 1 (True) or 0 (False). In other words, the logistic regression model predicts P(Y=1) as a function of X.

Properties of Logistic Regression:

The predicted variable in logistic regression follows Bernoulli Distribution.
Logistic regression uses the Maximum Likelihood method for parameter estimation
We don’t have to compute R Square, Model fitness is calculated through Concordance, KS-Statistics.

Linear Regression Vs. Logistic Regression

How logistic regression works

In logistic regression the algorithm is based on logistic function(1/(1+e^(-x))) that output the probability between 0 and 1. If we plot them in a graph it resultant curve would be in an S-shaped curve like this.

As you can see from the above graph, the results of the logistic function would always be a probability between 0 and 1.

Let’s take an example to make it more precise & clear.

Suppose that the x-axis denotes the number of goals scored by Lionel Messi and the y-axis denotes the probability of Barcelona winning the match. Let’s also assume that the x-axis values range from 0 to 50. So, according to the S-curve, it would mean that there is a greater probability of Barcelona winning the match if Lionel Messi scores more than 3 goals. Similarly, there’s a greater probability of Barcelona losing the match if Lionel Messi scores less than 2 goals.

Types of Logistic Regression

Types of Logistic Regression:

Binary Logistic Regression: The predicted variable has only two possible outcomes such as Cat or Dog, Positive or Negative.
Multinomial Logistic Regression: The predicted variable has three or more nominal categories such as predicting the type of dog’s breed.
Ordinal Logistic Regression: The predicted variable has three or more ordinal categories such as education level(“high school”,” Graduation”,” Post-graduation”,” Ph.D.”).

Where to use logistic regression

We have a binary or dichotomous target variable.
We have predictor X-variables that we think are related to the Y-variable.

Use Cases of logistic regression

Financial sector

In the financial industry, this algorithm is used to predicts loan defaulters, credit scoring, loan distribution, and many more. Many giant companies like Morgan Stanley are using these methods.

Medical sector

In the medical industry, this algorithm is used to predict if a patient has diabetes or not. There are many other applications like breast cancer prediction, tumor prediction, and many more.

Telecommunication sector

In this sector, this algorithm is used to predict customer churn, so in this way, they can give better plans to the customer, so the customer won’t churn out.

Network security

In network security, logistic regression is used to predict if a network packet has successfully delivered or not.

Implementation of logistic regression in PyTorch

The dataset comes from the UCI Machine Learning repository, and it is related to economics. The classification goal is to predict whether personal income greater than(<=50K or >50K). You can download the dataset from here.

Importing required libraries

import torch
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import torch
import torch.nn as nn
 
import numpy as np
import matplotlib.pyplot as plt
 
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from torch.autograd import Variable
import torch.nn.functional as F

The dataset provides the information of customers. It includes 48,842 records and 15 columns.

Input variables

Variable	Type	Description
age	Numeric	Age of the individual
workclass	Categorical	Type of workclass: “Private”, “Self-emp-not-inc”, “Local-gov”, “?”, “State-gov”, “Self-emp-inc”, “Federal-gov”, “Without-pay”, “Never-worked”
fnlwgt	Numeric	Final weight of the individual in sampling
education	Categorical	Educational level: “HS-grad”, “Some-college”, “Bachelors”, “Masters”, “Assoc-voc”, “11th”, “Assoc-acdm”, “10th”, “7th-8th”, “Prof-school”, “9th”, “12th”, “Doctorate”, “5th-6th”, “1st-4th”, “Preschool”
educational-num	Numeric	Numeric representation of education
marital-status	Categorical	Marital status: “Married-civ-spouse”, “Never-married”, “Divorced”, “Separated”, “Widowed”, “Married-spouse-absent”, “Married-AF-spouse”
occupation	Categorical	Occupation type: “Prof-specialty”, “Craft-repair”, “Exec-managerial”, “Adm-clerical”, “Sales”, “Other-service”, “Machine-op-inspct”, “?”, “Transport-moving”, “Handlers-cleaners”, “Farming-fishing”, “Tech-support”, “Protective-serv”, “Priv-house-serv”, “Armed-Forces”
relationship	Categorical	Relationship type: “Husband”, “Not-in-family”, “Own-child”, “Unmarried”, “Wife”, “Other-relative”
race	Categorical	Race of the individual: “White”, “Black”, “Asian-Pac-Islander”, “Amer-Indian-Eskimo”, “Other”
gender	Categorical	Gender: “Male”, “Female”
capital-gain	Numeric	Capital gains
capital-loss	Numeric	Capital losses
hours-per-week	Numeric	Number of hours worked per week
native-country	Categorical	Native country: “United-States”, “Mexico”, “?”, “Philippines”, “Germany”, “Puerto-Rico”, “Canada”, “El-Salvador”, “India”, “Cuba”, “England”, “China”, “South”, “Jamaica”, “Italy”, “Dominican-Republic”, “Japan”, “Guatemala”, “Poland”, “Vietnam”, “Columbia”, “Haiti”, “Portugal”, “Taiwan”, “Iran”, “Greece”, “Nicaragua”, “Peru”, “Ecuador”, “France”, “Ireland”, “Hong”, “Thailand”, “Cambodia”, “Trinadad&Tobago”, “Outlying-US(Guam-USVI-etc)”, “Laos”, “Yugoslavia”, “Scotland”, “Honduras”, “Hungary”, “Holand-Netherlands”
income	Categorical	Income level: “<=50K”, “>50K”

Data exploration

Now we make a function name plotPerColumnDistribution which plots all the columns.

# Distribution graphs (histogram/bar graph) of column data
def plotPerColumnDistribution(df, nGraphShown, nGraphPerRow):
    nunique = df.nunique()
    df = df[[col for col in df if nunique[col] > 1 and nunique[col] < 50]] # For displaying purposes, pick columns that have between 1 and 50 unique values
    nRow, nCol = df.shape
    columnNames = list(df)
    nGraphRow = (nCol + nGraphPerRow - 1) / nGraphPerRow
    plt.figure(num = None, figsize = (6 * nGraphPerRow, 8 * nGraphRow), dpi = 80, facecolor = 'w', edgecolor = 'k')
    for i in range(min(nCol, nGraphShown)):
        plt.subplot(nGraphRow, nGraphPerRow, i + 1)
        columnDf = df.iloc[:, i]
        if (not np.issubdtype(type(columnDf.iloc[0]), np.number)):
            valueCounts = columnDf.value_counts()
            valueCounts.plot.bar()
        else:
            columnDf.hist()
        plt.ylabel('counts')
        plt.xticks(rotation = 90)
        plt.title(f'{columnNames[i]} (column {i})')
    plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
    plt.show()
     
    plotPerColumnDistribution(data, 10, 3)

Data Cleaning

Check for missing values in the columns.
Now we encode all the categorical columns in the dataset.
After that, we define features and target variables in the dataset. Next, we split the dataset into training and test set.

# there are 0 missing values in the dataset
data.isnull().sum() 
 
# Encode all the categorical columns
for col_name in data.columns:
  if(data[col_name].dtype == 'object'):
    data[col_name]= data[col_name].astype('category')
    data[col_name] = data[col_name].cat.codes
     
# Define features and target variables in the dataset
features = data.loc[:, data.columns != 'income']
target= data['income'] 
 
nc =(len(data.columns))-1
 
traindf, testdf = train_test_split(data, test_size=0.2)
x_data = Variable(torch.Tensor(traindf.iloc[:,0:nc].values)) #Variable(torch.Tensor([[1.0], [2.0], [3.0]]))
y_data = (Variable(torch.Tensor(traindf.iloc[:,nc:].values))) #Variable(torch.Tensor([[2.0], [4.0], [6.0]]))
xt_data = Variable(torch.Tensor(testdf.iloc[:,0:nc].values)) #test input data
yt_data = (Variable(torch.Tensor(testdf.iloc[:,nc:].values))) #test output data

Implement logistic regression

Next, we define model class aka which is logistic regression.

class Model(torch.nn.Module):
 
    def __init__(self,input_size, num_classes): #initializing
        super(Model, self).__init__()
        self.linear = torch.nn.Linear(input_size, num_classes)  # hidden layer
 
 
    def forward(self, x):
        y_pred = F.sigmoid(self.linear(x))
        return y_pred
         
model = Model(14, 1)
criterion = torch.nn.BCELoss(size_average=True)#.nn.CrossEntropyLoss()#nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

In which we initialize init constructor and then we instantiate two nn.Linear module. In the forward function, we accept a Variable of input data and we must return a Variable of output data. We can use Modules defined in the constructor as well as arbitrary operators on Variables.

Now we save our model and Construct our loss function and an Optimizer.

Training and prediction

Let us start the training loop

# Training loop
for epoch in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x_data)
    # Compute and print loss
    loss = criterion(y_pred, y_data)
    print('epoch {}, loss {}',epoch, loss.item())
 
    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
 
#### Test the Model  ####
predicted = model(xt_data)
meanz=predicted.mean()
meanz=1.25*meanz
for cnt in range (len(predicted)):
    if predicted[cnt]>meanz : predicted[cnt]=1
    else: predicted[cnt]=0
 
total = yt_data.size(0)
correct = (predicted == yt_data).sum()
TPcorrect=0
for cnt in range (len(predicted)):
    if ((predicted[cnt] == yt_data[cnt])and (predicted[cnt] == 0)) : TPcorrect=TPcorrect+ 1
 
 
 
print('Accuracy of the model  %d %%' % (100 * correct // total))

We perform 500 iterations and then compute predicted y by passing x to the model, calculate loss, Zero gradients, perform a backward pass, and update the weights.

Now we test the model and make the prediction. The accuracy that we are getting is 76%.

Advantages and disadvantages of logistic regression

Advantages

It is fast to train the logistic regression model
It works well on simple datasets.
We can also use a logistic regression model for predicting multiple classes.
There is no violation of Ordinary least square assumptions.
It can handle polytomous data(more than two distinct categories).

Disadvantages

In non-linear models, the effect is not consistent.
Sometimes it fails to capture the complex relationship between variables.
It requires large size datasets for stable results.
It creates a problem when group efficiency distributions have little overlap.

Wrap up the Session

Finally, we have made it to the end of the tutorial. You may know

what is logistic regression,
properties of logistic regression,
differences between linear regression and logistic regression
types of logistic regression
where to use logistic regression
Logistic Regression Assumptions
the use-case of logistic regression
how to implement logistic regression in PyTorch.

You’ve come a long way in understanding one of the most important areas of machine learning! If you have questions or comments, then please put them in the comments section below.

You can also join our telegram channel to get free cheatsheets, projects, ebooks, study material related to machine learning, deep learning, data science, natural language processing, python programming, r programming, big data, and many more.