Explaining the concepts of dimensionality reduction

Dimensionality reduction is a technique that is used for reducing the number of input variables present in the dataset. This is done to ease the task of modeling as the lesser the number of input variables. The easier is the way of modeling. The number of features or input variables present in the dataset is called the dimensionality of the dataset.

In this article, we are going to learn about dimensionality reduction, techniques used for dimensionality reduction, applications, advantages, and disadvantages of implementing dimensionality reduction. After reading this article, you will be familiar with the concepts of dimensionality reduction and will be able to decide when to apply and when not to apply dimensionality reduction.

Let us now begin with the tutorial.

Table of Contents

Introduction

A dataset consists of a large number of input features, which complicates the task of predictive modeling. With the increase in complications, making predictions or visualizing training datasets for a large number of input features becomes difficult. Therefore some of the input features with less importance need to be removed. The input features, columns of the dataset, or variables present in the dataset are known as “dimensionality,” and the process through which we remove these input features is called “dimensionality reduction.”

The dimensionality reduction technique is widely used in machine learning as it helps data scientists obtain a predictive model that fits better and helps solve classification and regression problems with ease. Thus, the dimensionality reduction technique is a way through which a better fit predictive model can be obtained by converting the dataset with higher dimensions into a dataset with lower dimensions ensuring that the information provided through the dataset is intact. With the help of the dimensionality reduction technique, the number of columns can be decreased or brought down to a quantifiable number, transforming the 3D sphere or cube into a 2D circle or square, respectively.

However, we can model our dataset even without applying dimensionality reduction techniques by feeding the original dataset directly to the machine learning algorithms and letting everything go on naturally. But, the curse of dimensionality drives us and mandates the use of dimensionality reduction.

Curse of Dimensionality

The curse of dimensionality is a phenomenon that occurs when we work with a dataset that exists in a high-dimensional space and not a low-dimensional space. Handling the higher dimensional dataset is difficult and is referred to as the “curse of dimensionality.” As the number of features in a machine learning model increases, so does the number of samples and the likelihood of overfitting. The dimensionality reduction technique saves us from all of these problems, and it becomes mandatory to apply the dimensionality reduction technique.

What is the advantage and disadvantage of using the dimensionality reduction technique?

The dimensionality reduction technique is beneficial for decreasing the number of features or input variables which decreases the complexity of the machine learning model. Thus, there are a lot of advantages to applying dimensionality reduction techniques.

The advantages of implementing dimensionality reduction techniques are as follows:

It reduces the high-dimensional dataset to a lower-dimensional dataset which decreases the dimensions of the features. Hence, the space needed to store the dataset gets reduced with the decrease in the dimensions.
With the reduced number of input features and reduced dimensions of the dataset, the computation time required by the model also decreases.
It also helps in reducing the time required to visualize the dataset.
Dimensionality reduction takes care of multicollinearity as it removes redundant features present in the dataset.
It helps in improving the accuracy and performance of the model.
It eliminates redundant features and the noise present in the dataset.

Some of the disadvantages of implementing the dimensionality reduction method are:

Since the dimensionality reduction method reduces the number of features, it is a possibility that some data might be lost due to this.
With the implementation of the PCA dimensionality reduction technique, sometimes the principal components to be considered are unknown.

Dimensionality reduction techniques. (Explain each with examples)

The dimensionality reduction techniques are broadly categorized into two categories:

Feature selection: The feature selection method works by finding a subset of relevant input variables from the dataset and includes three strategies to perform this action:
Filter method: In this method, the dataset provided to us is filtered, and a subset with relevant features is taken.
Wrapper method: In this method, the features are fed to the machine learning model, which is used for evaluating the performance, and based on the performance, the removal or addition of features to increase the accuracy of the model is decided.
Embedded method: It checks various training iterations of the machine learning model and evaluates the importance of each feature of the dataset.

Feature Extraction: The feature extraction technique converts the data from a higher-dimensional space to a lower-dimensional space and is also called feature projection. The data transformation through feature extraction can be linear or non-linear depending on the dataset.

Let us now discuss some of the commonly used dimensionality reduction techniques which we can implement in real-life machine learning techniques.

Principal Component Analysis (PCA):

It is a dimensionality reduction method that transforms a large set of variables into smaller ones while containing most of the information. It is used for reducing the dimensionality of the higher dimensional dataset into a low dimension dataset.

It is one of the leading techniques of dimensionality reduction, which follows a statistical approach and orthogonally converts the coordinates of the original dataset into a set of new coordinates, which are known as principal components. The creation of the first principal component having maximum variance can be seen in the conversion results. The principal component analysis can be explained in five steps:

The first step is standardization which standardizes the range of continuous initial variables in a way that each of them contributes in an equal proportion to the analysis. Mathematically, standardization can be done by using the formula below:

The second step is the computation of the covariance matrix, where the covariance matrix is a P x P symmetric matrix with P as the number of dimensions.

The third step is to compute the eigenvectors and eigenvalues of the covariance matrix, which determines the principal components of the data.

The fourth step is to create a matrix of vectors called the feature vector, which is created by discarding the components with low eigenvalues.

The last step is recasting the data along the principal component axes where the feature vector is used. This is done to get the final dataset which can be visualized by the formula given below:

The formula for Recasting the data in PCA

Thus, we finally obtain the reduced dataset using the principal component analysis method.

Principal component analysis in Python

The first step is to import all the required libraries. Numpy is a library used for numerical computations and manipulation of arrays, Pandas is used for data manipulation and analysis, and Seaborn &Matplotlib are used for data visualization.

# numpy is used for numerical computations
import numpy as np

# used to load the CSV files
import pandas as pd

#used to plot visualization like corr plot, bar chart, etc.
import seaborn as sns
import matplotlib.pyplot as plt

The second step is to load the CSV file with the delimiter as “;”. After that with the help of the head function, the top 5 rows of the dataset were displayed.

df= pd.read_csv(‘Dyt-desktop.csv’,sep=”;”)
df.head()

The third step is to encode all the categorical columns. The cat.codes attribute assigns a unique integer value to each category in the column, which is useful for applying machine learning algorithms that require numerical input values.

for col in df.columns:
    if(df[col].dtype == ‘object’):
        df[col]= df[col].astype(‘category’)
        df[col] = df[col].cat.codes

The fourth step is to define the input and the target variables.

y=df[‘Dyslexia’]
X=df.drop(labels=’Dyslexia’, axis=1)

The fifth step is to split the dataset into training and testing with the help of the train_test_split function. The size of testing data is 20% and the stratify parameter is used to deal with the data imbalance problem.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0,
stratify =df[‘Dyslexia’])

Next feature scaling is performed on the numerical columns. The StandardScaler class from the sklearn.preprocessing the module is used for this purpose.

It’s worth noting that the num_cols the variable is used to select only the numerical columns from the X_new dataset, as scaling only the numerical columns is more appropriate in this case.

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

# selecting the continuous column
numerics = [‘int16’, ‘int32’, ‘int64’, ‘float16’, ‘float32’, ‘float64’]
num_cols = X.select_dtypes(include=numerics)

# scaling the continuous column
X_train[num_cols.columns] = sc.fit_transform(X_train[num_cols.columns])
X_test[num_cols.columns] = sc.transform(X_test[num_cols.columns])

Next Principal Component Analysis (PCA) is applied to the training and testing sets. PCA is a dimensionality reduction technique that is used to reduce the number of features in a dataset while maintaining the information that explains the most variance.

from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
explained_variance

array([0.14807762, 0.11132381])

The next step is to plot the principal component.

It creates a scatter plot of the first two principal components, colored by the class label. It uses the y_train data to identify which points belong to which class. The red points belong to class 1 and the blue points belong to class 2.

# plotting the principal component

import matplotlib.pyplot as plt

# Create a scatter plot of the first two principal components, colored by class label
plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], color=’red’, label=’Class 1′)
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], color=’blue’, label=’Class 2′)

# Add labels to the axes
plt.xlabel(‘First Principal Component’)
plt.ylabel(‘Second Principal Component’)

# Add a legend
plt.legend()

# Show the plot
plt.show()

Linear Discriminant Analysis (LDA):

Linear discriminant analysis, also called normal discriminant analysis or discriminant function analysis is a commonly used dimensionality reduction technique for supervised classification problems. It is used for separating two or more classes and projects the features of a higher-dimensional dataset into a lower-dimensional dataset. LDA uses both the X and Y axes to create a new axis and then projects the data onto the new axis, which maximizes the separation between the two categories.

To create a new axis, two criteria need to be fulfilled, which include minimizing the variation within each class and maximizing the distance between the means of the other two classes. Thus, a 2D dataset is now mapped onto the 1D dataset, reducing the dimensionality of the dataset. The LDA can be further extended to three more methods: quadratic discriminant analysis, flexible discriminant analysis, and regularized discriminant analysis. LDA is implemented in the field of computer vision for face recognition, in the field of medicine to classify patient disease, and for customer identification purposes.

Linear discriminant analysis in Python

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Create an LDA object
lda = LinearDiscriminantAnalysis(n_components=2)

# Fit the LDA model on the training data
lda.fit(X_train, y_train)

# Transform the training and test data using the fitted LDA model
X_train_lda = lda.transform(X_train)
X_test_lda = lda.transform(X_test)

Non-negative matrix factorization (NMF):

It is a dimensionality reduction method that is widely used for various NLP tasks and for feature extraction in facial recognition. NMF factorizes a matrix with m x n dimensions into two matrices having dimensions m x k and k x n, respectively. The goal of NMF is to find two matrices by setting a lower dimension as k and having only non-negative elements in both of these matrices.

Through the implementation of NMF, we are able to obtain factorized matrices that have lower dimensions than the product matrix. NMF assumes that the original dataset consists of hidden features, which means that it assumes weights associated with two newly formed matrices.

Non-negative matrix factorization in Python

from sklearn.decomposition import NMF

# Create an NMF object
nmf = NMF(n_components=2, init=’random’, random_state=0)

# Fit the NMF model on the data matrix
W = nmf.fit_transform(X)
H = nmf.components_

# Reconstruct the original matrix
X_reconstructed = W.dot(H)

Generalized discriminant analysis (GDA):

Generalized discriminant analysis is a dimensionality reduction technique that deals with non-linear discriminant analysis using the kernel function operator. The GDA method works by mapping the input vectors into the high-dimensional feature spaces with the help of an underlying theory similar to the support vector machine (SVM). The objective of GDA is similar to that of LDA, and it performs the operation by finding a projection for the features in a low-dimensional space. This is done by maximizing the within-class, and between-class scatter ratios. The main idea behind the implementation of GDA is to map the input space into a convenient feature space where the variables are nonlinearly related to the input space.

Generalized discriminant analysis in Python

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Create a GDA object
gda = LinearDiscriminantAnalysis(solver=’eigen’, shrinkage=’auto’)

# Fit the GDA model on the training data
gda.fit(X_train, y_train)

# Predict the class labels of the test data
y_pred = gda.predict(X_test)

Missing values ratio:

The missing values ratio is a dimensionality reduction technique that is done by calculating the missing values present in the dataset. We need to calculate the ratio of missing values which can be done by using the formula provided below:

Once this ratio is calculated, we need to decide on a threshold that will be used for dropping all the variables that has a missing value ratio more than the threshold. The threshold varies from problem to problem, and there is no hard and fast rule for deciding the threshold.

Once the variable with missing values greater than the threshold is dropped, we need to deal with the remaining variables, and this is done by trying to find out the reason for the missing values.

Application of dimensionality reduction

Dimensionality reduction is one of the most popular techniques in the field of machine learning, as it has a lot of applications in solving real-world problems. The dimensionality of the real-world data is very high, and therefore, dimensionality reduction is made to decrease the complexity and increase the accuracy of the models. Some of the applications of dimensionality reduction methods are discussed here:

The dimensionality reduction can be used for compressing the neural network architecture, which can be achieved using autoencoders.
Dimensionality reduction can be implemented to improve the accuracy of the models with the help of noise removal techniques.
It transforms the non-linear data into a linearly separable form using the dimensionality reduction method called the Kernel PCA.
The dimensionality reduction method reduces the complexity of the models. It also reduces the time required by the models for training and saves a lot of computational resources when the models are being trained.
The dimensionality reduction technique can be used for the compression of images and is used in factor analysis through the principal component analysis (PCA) method.
The dimensionality reduction method can be implemented to mitigate the problems of overfitting and automatically removes multicollinearity, which negatively affects the performance of classification and regression models.
It is difficult to visualize the high-dimensional dataset, but with the help of dimensionality reduction techniques, visualization of high-dimensional data can be performed by lowering the dimensions of the dataset.

Conclusion

Machine learning finds a lot of applications for the dimensionality reduction technique as the real-world dataset is very complex to model in the absence of the dimensionality reduction method. There are a lot of dimensionality reduction methods, and we have tried discussing a few of them, which are more popular than the others. Thus, in this article, we have discussed the dimensionality reduction technique, its advantages and disadvantages, techniques through which dimensionality reduction can be implemented, and finally, the applications of dimensionality reduction. This article provides us with a detailed knowledge of dimensionality reduction through which a beginner can advance their knowledge of dimensionality reduction and is able to implement dimensionality reduction techniques in real-life.

If you like the article and would like to support me, make sure to:

👏 Like for this article and subscribe to our newsletter
📰 View more content on my DataSpoof website
🔔 Follow Me: LinkedIn| Youtube | Instagram | Twitter