You are currently viewing K-Means Clustering in Python using Scikit-learn

K-Means Clustering in Python using Scikit-learn

Introduction

Suppose that you are working in the Walmart store and your task is to make a cluster of similar items. And you end up making four clusters of items like food items, beauty products, electronics items, and toys. So the definition of clustering is

“When you bunch similar items together that is called clustering”.

Its main aim is to keep similar points in a group and dissimilar points in different groups. Clustering can be broadly classified into two subgroups:

1- Hard clustering: In this type, each data point either belongs to a cluster completely or not

2- Soft clustering: In these data points can belong to more than one cluster with some likelihood value.

For making this cluster we use machine learning algorithms like k-means clustering and hierarchical clusters. So how do we identify which algorithms to use for making clusters?

Types of clustering algorithm are:

1- centroid based clustering like our K-means

It organizes the data into non-hierarchical clusters like the clustering of large files. These types of methods requires a parameter k (which is no of cluster) which is defined by the user.

2- connectivity-based clustering like hierarchy

It forms tree-like architecture and making Children of every node. Here the “electronic items” is the root node, it has two Childs which is mobiles and laptops and the last node is called as leaf node.

Hierarchical clustering
Connectivity based clustering

3- density-based clustering like DBSCAN

In this type of clustering, we make clusters based on the dense clusters of points. The main benefits of using this it does not require the no of clusters from users instead of that there is distance parameter that he have to tune. Another benefits is that it is resilient to anomalies.

In this tutorial we are going to learn about k-means clustering and its implementation using python.

What is K-means Clustering

K-means clustering is an unsupervised machine learning algorithm that attempts to minimize the distance of the points in a cluster with their centroid.

Note: Clusters are represented by a central vector.

It is a centroid based algorithm which means we calculate the minimum sum of distances between the points and the cluster centroid. It is also used to solve optimization problems.

Algorithm for k-means clustering

The step of k-means clustering algorithm are given as follows:-

Step 1- Pick a random no k where k is no the cluster that we have to make.

Step 2- Now we have to randomly initialize k centroids. Whereas centroids addresses the center of a cluster.

Step 3- Allocate each data points to their nearest centroid, which will frame the predefined K cluster.

Step 4- Now compute the variance and spot another centroid of each cluster

Step 5- Repeat steps 3 and 4 if any reassignment occurs else go to Step 6.

Step 6- Finish. Our model is ready

Repeat first and second steps until we get the same mean.

How to select the Right Number of Clusters in K-Means Clustering?

Truly, we can have quite a few clusters in a dataset. One things that we can do is to check the no of observations in the dataset. It will give you an idea how many clusters that you want to use. But to determine the optimum numbers of clusters there are two possible ways which are given down below

  1. By using the elbow curve
  2. Silhouette coefficients

To get the elbow curve we have to plot no the clusters on the x-axis and Sum of squared error (SSE) on the y-axis. After a certain time the SSE starts to bend. This points that define the optimal no of clusters is called as “elbow point”, and can be used as a measure to pick the best k value. We will see the implementation in the coding section.

In the silhouette coefficients it measures how well an data point fits into its allocated cluster based on how closely or far away data points in to further data points in the cluster. On the x-axis we plot the no of clusters and on the y-axis we plot the silhouette coefficients. The average silhouette score give us the value of k.

How to solve k-means mathematically

Let us implement this procedure mathematically

Problem Statements: Given a random set of no= {2,3,4,10,11,12,20,25,30 } and we have to make 2 possible cluster using k means

Solution– We have to pick 2 mean value because k=2 from the given set of random no. Let us say we pick m1 = 4 and m2 = 12

Step 2: Now we cluster all the points which are nearest to that value. For the value 4 the nearest points are {2,3,4} and for value 12 the nearest points are {10,11,12,20,25,30}

K1={2,3,4} & K2={10,11,12,20,25,30}

Step 3: Now from K1 and K2 we select the mean of those values which is m1=3 and m2= 18 (108/6)

Step 4: Now we repeat steps 2 and 3 and we update the value of k1={2,3,4,10} and k2={11,12,20,25,30} and then we again calculate the mean of those updated values

m1= 5 and m2= 20 (roundoff value)

k1={2,3,4,10,11,12} and k2={20,25,30}

m1= 7 and m2= 25

k1={2,3,4,10,11,12} and k2={20,25,30} and their mean is m1=7 and m2=25

Now we are getting the same mean so we have to stop repeating. The 2 clusters are k1={2,3,4,10,11,12} and k2={20,25,30}.

Real-world applications of K-Means

There are a lot of applications for K-Means. Some of them are

1- Optical character recognition: OCR is a technology that is used to extract text from images.

2- Cell nuclei detection: It is the most basic step for analyzing the cells. We use the K-means algorithm for the segmentation of cells.

3- Anomaly detection: Anomaly is equal to points that are far from the cluster. So in that case we use the k-means algorithm.

4- Customer Segmentation: Segmenting the customer based on their sales pattern.

5- Recommendation engine: It is also used in google search for recommending the most similar keywords.

Advantages and disadvantages of using k-means clustering

Advantages

  • It is relatively efficient. Its complexity is O(nkt) where t is the number of iterations.
  •  It works well when the data is clearly separated.

Disadvantages

  • The most common problem in clustering is the clusters are of unequal size or density.
  • When the clusters are in non-spherical shapes.
  • K means algorithm can be affected when we have outliers in the dataset. It will change the expected results. So it is very necessary to remove the outliers before applying k-means.

To overcome this problem we use the k-means++ algorithm.

K-means++ solves the problem by selecting the cluster centroid in the early stages and all the remaining steps are the same.

Steps:

  • Randomly choose the first cluster
  • next, we compute the distance between each point and choose centroids.
  • choose the next centroid from the data with the probability of x being proportional to all distances
  • repeat process 2 and 3 until we choose k centroids

Implementing k-means clustering using Scikit-learn in Python

We are going to code a real-life problem which is clustering documents based on their similarity. You can download this dataset directly from this link.

Let us explore the data we have two columns first is Idea and the second is Topic. The topic contains NA values. And we have to cluster the data and fill all these NA values with the corresponding topic. 

So there are the following steps that you have to take to solve this problem.

Step 1- Load the required libraries

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.cluster import KMeans

Step 2- Load the data and select the column that we have to use

data=pd.read_excel('data.xlsx') 
idea=data.iloc[:,0:1] 

Step 3- Converting the column of data from an excel sheet into a list of documents, where each document corresponds to a group of sentences.

corpus=[]
for index,row in idea.iterrows():
    corpus.append(row['Idea'])

Step 4- After that, we use a count vectorizer for tokenizing the collection of text documents and build a vocabulary of known words. Then we transform the data.

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['Idea'].values.astype('U'))

Step 5- In this step, we use TfidfTransformer to convert the collection of raw documents into a matrix of TF-IDF features. After that, we transform it.

To learn more about TF_IDF you can check this link.

transformer = TfidfTransformer(smooth_idf=False)
tfidf = transformer.fit_transform(X)
print(tfidf.shape)

Step 6- Now the second last step is to define the no of clusters in our case it is 5. And then we call our KMeans model and fit it.

num_clusters = 5#Change it according to your data.
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf)
clusters = km.labels_.tolist()

Step 7- Now the last step is to create a dictionary containing the value of the document with the corresponding cluster number. And then we convert it as a data- frame.

idea={'Idea':corpus, 'Cluster':clusters} #Creating dict having doc with the corresponding cluster number.
frame=pd.DataFrame(idea,index=[clusters], columns=['Idea','Cluster']) # Converting it into a dataframe.

print("\n")
print(frame) #Print the doc with the labeled cluster number.
print("\n")
print(frame['Cluster'].value_counts()) #Print the counts of doc belonging to each cluster.
Final output of k-means clustering

Wrap up the Session

In this blog we have learned about what is clustering, types of clustering, what is k-means clustering and how it works, then we understand the algorithm mathematically as well as programmatically, after that we learn where to use the k-means clustering and then we learn about what are pros and cons of k-means clustering and how to overcome it and then we code this algorithm for a real-life problem which is document clustering.

If you want to learn about machine learning, deep learning, natural language processing, computer vision. You can subscribe to my blog for getting notified when I release a new blog.

If you liked this tutorial please like it, share it, and if you have any problem regarding the implementation or any topic feel free to leave a comment.