Blending Classifier in Python

Blending is an ensemble machine learning technique that uses a machine learning model to learn how to best combine the predictions from multiple contributing ensemble member models.

It is applicable on the regression as well as the classification problem.

Table of Contents

Steps to implement Blending classifier in Python

Step 1- Define your base model

Step 2- Define your blending model

Step 3- Train your training data on the base model

Step 4- Get the predictions from the base model

Step 5- Create an Empty list and append the prediction into this empty list.

Step 6- convert the list into numpy array in transpose format.

Step 7- Train your blending model on the converted numpy array

Step 8- Make the final prediction and get your evaluation metrics.

Dataset description

The dataset contains the information about the heart patients who have 10 year coronary heart disease or not. It is a classification problem. You can download the dataset from the Kaggle.

Column Name	Description	Type
`male`	Gender of the individual	Binary (int)
`age`	Age of the individual (in years)	Numeric (int)
`education`	Education level of the individual	Categorical (int)
`currentSmoker`	Whether the individual currently smokes	Binary (int)
`cigsPerDay`	Average number of cigarettes smoked per day	Numeric (float)
`BPMeds`	Whether the individual is on blood pressure medication	Binary (int)
`prevalentStroke`	Whether the individual has had a stroke	Binary (int)
`prevalentHyp`	Whether the individual has hypertension	Binary (int)
`diabetes`	Whether the individual has diabetes	Binary (int)
`totChol`	Total cholesterol level (mg/dL)	Numeric (float)
`sysBP`	Systolic blood pressure (mmHg)	Numeric (float)
`diaBP`	Diastolic blood pressure (mmHg)	Numeric (float)
`BMI`	Body Mass Index (kg/m²)	Numeric (float)
`heartRate`	Heart rate (beats per minute)	Numeric (float)
`glucose`	Glucose level (mg/dL)	Numeric (float)
`TenYearCHD`	Whether the individual developed CHD within 10 years	Binary (int)

Implementation of Blending Classifier in Python

The first step is to import all the necessary libraries

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

Here

the pandas library is used for the data loading and preprocessing.
NumPy library is used for numerical operations.
Scikit-learn is used for implementation for data splitting, machine learning and the evaluation.

The second step is to read the Framingham dataset using the pandas library.

df=pd.read_csv("framingham.csv") #tenyearCHD- coronary heart disease
df.head()

By looking at the data there are the following insights obtained.

All the columns are in numerical format (no label encoding required).
There are no spelling error in the column names

The next step is to deal with missing values by dropping from the dataset using the dropna function.

df=df.dropna()

The next step is to analyze the target column using value_counts function.

Here you can see there is clear problem of data imbalance which will be resolved with the help of the Undersampling techniques.

# Separate majority and minority classes
df_majority = df[df.TenYearCHD == 0]
df_minority = df[df.TenYearCHD == 1]

# Randomly undersample the majority class
df_majority_undersampled = df_majority.sample(n=len(df_minority), random_state=42)

# Combine minority and undersampled majority classes
df_balanced = pd.concat([df_majority_undersampled, df_minority])

# Shuffle the dataset
df_balanced = df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

The next step is to define the input features and the target column. Here the TenYearCHD is the target columns.

features = df_balanced.loc[:, df_balanced.columns != 'TenYearCHD']
target= df_balanced['TenYearCHD']

The next step is to split the dataset into training and testing in the ratio of 80% and 20%. The random state is set to 42 for reproducibility.

X_train, X_test, y_train, y_test= train_test_split(features,target,test_size=0.2, random_state=42)

The next step is to define the base model such as logistic regression, random forest classifier, and the support vector machine.

base_models = [
    ('lr', LogisticRegression()),
    ('rf', RandomForestClassifier()),
    ('svm', SVC(probability=True))
]

The next step is to define the blending model which is Logistic Regression.

# Initialize blending model
blending_model = LogisticRegression()

The next step is to train the base model on the training data.

# Train base models
for name, model in base_models:
    model.fit(X_train, y_train)

The next step is to make the prediction with the base model.

# Make predictions with base models
base_predictions = []
for name, model in base_models:
    base_predictions.append(model.predict_proba(X_test)[:, 1])

Now you will reshape the base prediction (aka transpose) and then train the blended model for the base predictions.

# Reshape predictions for blending
base_predictions = np.array(base_predictions).T

# Train blending model
blending_model.fit(base_predictions, y_test)

Now you will make the final prediction using the blending model.

# Make final predictions using blending
blending_predictions = blending_model.predict(base_predictions)

After that you will evaluate the performance of the blending model using accuracy score and the classification report.

# Evaluate blending model
blending_accuracy = accuracy_score(y_test, blending_predictions)
print("Blending Accuracy:", blending_accuracy)
# Blending Accuracy: 0.6547085201793722


from sklearn.metrics import accuracy_score, classification_report
print("Classification Report:")
print(classification_report(y_test, blending_predictions))

Conclusion

In conclusion, Blending combines predictions from multiple machine learning models to improve accuracy and robustness, reducing individual model biases and variance, and achieving better generalization on unseen data.

Tell me in the comments if you have any problems regarding implementation, feel free to drop a comment, and I will reply within 24 hours.

If you like the article and would like to support me, make sure to: