• Abhishek Singh

K-Means Clustering in Python using scikit-learn





Some other blog post that you may want to read is





Introduction

Suppose that you are working in the WallMart store and your task is to make a cluster of similar items. And you end up making four clusters of items like food items, beauty products, electronics items, and toys. So the definition of clustering is


"When you bunch similar items together that is called clustering".


Its main aim is to keep similar points in a group and dissimilar points in different groups. Clustering can be broadly classified into two subgroups:


1- Hard clustering: In this type, each data point either belongs to a cluster completely or not

2- Soft clustering: In these data points can belong to more than one cluster with some likelihood value.


For making this cluster we use machine learning algorithms like k-means and hierarchical clusters. So how do we identify which algorithms to use for making clusters?


Types of clustering algorithm are:


1- centroid based clustering like our K-means


It organizes the data into non-hierarchical clusters like the clustering of large files.



2- connectivity-based clustering like hierarchy


It forms tree-like architecture and making Children of every node.



Hierarchical clustering

3- density-based clustering like DBSCAN


In this type of clustering, we make clusters based on the dense clusters of points.


DBSCAN

In this tutorial we are going to learn about k means clustering and its implementation using python.


What is K-means Clustering

K-means clustering is an unsupervised algorithm that attempts to minimize the distance of the points in a cluster with their centroid.


Note: Clusters are represented by a central vector.


It is a centroid based algorithm which means we calculate the minimum sum of distances between the points and the cluster centroid. It is also used to solve optimization problems.




How does it work

Suppose we are working in a company and the boss assigned you some tasks. Your task is to make two clusters of these numbers using k means. So how do you solve this?


There are some following steps are required to solve the problem


Step 1- Randomly pick k mean value from the set of points where k is no the cluster that we have to make.


Step 2- Find the closest number of the mean and put it in one single cluster.


Step3- Repeat first and second steps until we get the same mean.


k-means python

k-means python


Now the questions arise what we should do when we don't know how many clusters are in given data.


There is one most common approach is we try with different k-values and then calculate the intra-cluster variance. Increasing the k-value leads to a decrease in intra-cluster variance, while a decrease will sharply increase your error sum. This points that define the optimal no of clusters is called as "elbow point", and can be used as a measure to pick the best k value.



Elbow point


How to solve k-means mathematically

Let us implement this procedure mathematically


Problem Statements: Given a random set of no= {2,3,4,10,11,12,20,25,30 } and we have to make 2 possible cluster using k means




Solve- we have to pick 2 mean value because k=2

m1 = 4 and m2 = 12

Apply Step 2:     k1={2,3,4} & k2={10,11,12,20,25,30}

Apply Step 3:  m1=3  and m2= 18 (108/6)

Repeat steps 2 and 3

k1={2,3,4,10} and k2={11,12,20,25,30}

m1= 5   and m2= 20 (roundoff value)
k1={2,3,4,10,11,12} and k2={20,25,30}

m1= 7  and m2= 25

k1={2,3,4,10,11,12} and k2={20,25,30} and their mean is m1=7 and m2=25

Now we are getting the same mean so we have to stop repeating.

The 2 clusters are  k1={2,3,4,10,11,12} and k2={20,25,30}. 


Real-world applications of K-Means

There are a lot of applications for K-Means. Some of them are


1- Optical character recognition: OCR is a technology that is used to extract text from images.


2- Cell nuclei detection: It is the most basic step for analyzing the cells. We use the K-means algorithm for the segmentation of cells.


3- Anomaly detection: Anomaly is equal to points that are far from the cluster. So in that case we use the k-means algorithm.


4- Customer Segmentation: Segmenting the customer based on their sales pattern.


5- Recommendation engine: It is also used in google search for recommending the most similar keywords.


Pros And Cons of using K-means

Pros

  • It is relatively efficient. Its complexity is O(nkt) where t is the number of iterations.

  •  It works well when the data is clearly separated.

Cons

  • The most common problem in clustering is the clusters are of unequal size or density.

  • When the clusters are in non-spherical shapes.

  • K means algorithm can be affected when we have outliers in the dataset. It will change the expected results. So it is very necessary to remove the outliers before applying k-means.


To overcome this problem we use the k-means++ algorithm.


K-means++ solves the problem by selecting the cluster centroid in the early stages and all the remaining steps are the same.


Steps:

  • Randomly choose the first cluster

  • next, we compute the distance between each point and choose centroids.

  • choose the next centroid from the data with the probability of x being proportional to all distances

  • repeat process 2 and 3 until we choose k centroids


Implementing K-Means Clustering using scikit-learn in Python

We are going to code a real-life problem which is clustering documents based on their similarity.


You can download this dataset directly from this link.


Let us explore the data we have two columns first is Idea and the second is Topic. The topic contains NA values. And we have to cluster the data and fill all these NA values with the corresponding topic. 

So there are the following steps that you have to take to solve this problem.


Step 1- Load the required libraries



Step 2- Load the data and select the column that we have to use


Step 3- Converting the column of data from an excel sheet into a list of documents, where each document corresponds to a group of sentences.



Step 4- After that, we use a count vectorizer for tokenizing the collection of text documents and build a vocabulary of known words. Then we transform the data.



Step 5- In this step, we use TfidfTransformer to convert the collection of raw documents into a matrix of TF-IDF features. After that, we transform it.

To learn more about TF_IDF you can check this link.



Step 6- Now the second last step is to define the no of clusters in our case it is 5.

And then we call our KMeans model and fit it.



Step 7- Now the last step is to create a dictionary containing the value of the document with the corresponding cluster number. And then we convert it as a data- frame.





Final output of k-means clustering


Wrap up the Session

In this blog we have learned about what is clustering, types of clustering, what is k-means clustering and how it works, then we understand the algorithm mathematically as well as programmatically, after that we learn where to use the k-means algorithm and then we learn about what are pros and cons of k-means clustering and how to overcome it and then we code this algorithm for a real-life problem which is document clustering.


If you want to learn about machine learning, deep learning, natural language processing, computer vision. You can subscribe to my blog for getting notified when I release a new blog.


If you liked this tutorial please like it, share it, and if you have any problem regarding the implementation or any topic feel free to leave a comment.

1,310 views