You are currently viewing K- Nearest Neighbors from scratch in python

K- Nearest Neighbors from scratch in python

The K-Nearest Neighbors is a straightforward algorithm, we can implement this algorithm very easily.

It is used to solve both classifications as well as regression problems. KNN algorithm is used in a variety of applications such as medical, banking, agriculture, and genomics.

There is one cool application which is customer churning, in this we predict whether the customer is going to cancel our subscription plan or stick with it.

In the medical sector, we can use KNN to classify between healthy patients and diabetic patients.

In agriculture, we use it to classify damaged crops and healthy crops. KNN works based on similarity features.

In this blog, we are going to cover the following topics:

What are K-Nearest Neighbors

KNN is a non-parametric and instance-based learning algorithm. 

Non-parametric means there is no assumption that we have to follow to implement KNN. It means, the model structure decided from the dataset. This will be useful practically very useful where the vast majority of the datasets don’t follow scientific hypothetical presumptions.

KNN is a type of instance-based learning or lazy learning which means the classifier immediately adapts as we collect new training data. It allows the algorithm to quickly respond when the changes in the input are made in real-time.

Along with these benefits, there are some disadvantages also like KNN doesn’t perform well on imbalanced data. Imbalance data means we have an uneven distribution of classes.

For example: let us consider a dataset called anomaly detection on some server. Primarily we have 2 classes like anomaly and not anomaly. In this case, we have a dissimilar distribution like a 50:1 ratio between anomaly and non-anomaly classes.

How does the KNN algorithm work?

Let us take an example and understand the algorithm.

In the below figure you can see that we have two classes. The first class is denoted by blue squares and the second class is denoted by the red triangle. Now we have to find the green circle belongs to which class?

To solve this problem we use KNN, where k represents the total number of neighbors.

Now we use KNN to find the k nearest neighbors for the green circles and take the majority vote of each data point and then we assign it to the green circles.

KNN

Now let us understand this thing with the help of an example, let us take the k value is 3 and we can see that we have 2 red triangles and 1 blue square as the closest neighbors to the green circle and we take the majority and assign green circle as a red triangle.

When we take k value as 5 you can see that it belongs to the blue square.

So the question arises is how to find the nearest neighbors?

To find the nearest neighbors we use distance function i.e it gives us the distance between the elements of a set.

Some of the common distance functions are Euclidean, Minkowski, Manhattan, etc.

The formula for all 3’s are

Formula to calculate distance

Now we have understood how the KNN algorithm works. Let us write all the steps in pseudo-code format.

  • The first step is to choose the value of K.
  • calculate the distance between the unknown point and all the available points.
  • Sort the calculated distance in ascending order based on distance values.
  • get the top k rows from the sorted array.
  • count the most frequent class of these rows
  • and then return the predicted class for unknown point.

How do we choose the value of K

There is no predefined value of K for datasets. It depends upon the requirements of the problem. Generally, larger k values decrease the effect of noise on classification and boundaries between classes less visible.

Or we can also use cross-validation to find the optimal value of k, we try with different k values and check how the validation error rate is varying, we choose the elbow point as best k value.

How to choose value of K

Difference between eager learners and lazy learners

Some of the major differences between eager learners and lazy learners are

1- Learning is fast in lazy and it can be slow in eager.

2- Prediction requires work so it can be slow in lazy, whereas predictions are fast in eager.

3- Lazy learners wait for querying before generalizing whereas eager learners generalize before seeing query.

Curse of Dimensionality

Curse of Dimensionality states that ” if the number of features” increases then it is very hard to predict new data. So in this way, the problems become computationally expensive and hard to solve.

To solve this problem we use dimensionality reduction techniques like principal component analysis, t-SNE, Sammon mapping, and laplacian Eigenmaps.

Coding KNN on a real-life example

# Import all functions necessary for the notebook
import csv
from random import randrange
import math
import operator

def load_csv(filename):
    dataset = list()
    with open(filename, 'r') as file:
        csv_reader = csv.reader(file)
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
    return dataset

## Convert string column to float
def str_column_to_float(dataset, column):
    for row in dataset:
        row[column] = float(row[column].strip())
        
## Convert string column to integer        
def str_column_to_int(dataset, column):
    class_values = [row[column] for row in dataset]
    unique = set(class_values)
    lookup = dict()
    for i, value in enumerate(unique):
        lookup[value] = i
    for row in dataset:
        row[column] = lookup[row[column]]
    return lookup

##### Normalize Data ###########

# Find the min and max values for each column

def dataset_minmax(dataset):
    minmax = list()
    for i in range(len(dataset[0])):
        colvalues = [row[i] for row in dataset]
        min_value = min(colvalues) 
        max_value = max(colvalues)
        minmax.append([min_value, max_value])
    return minmax

# Normalize the dataset except last row for classification values
def Normalize_Dataset(dataset, minmax):
    for row in dataset:
        for i in range(len(row)-1):
            row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] - minmax[i][0])

# Split a dataset into a train and test set
def train_test_split(dataset, split):
    train = list()
    train_size = split * len(dataset)
    dataset_copy = list(dataset)
    while len(train) < train_size:
        index = randrange(len(dataset_copy))
        train.append(dataset_copy.pop(index))
    return train, dataset_copy
    
# Split a dataset into k folds 

def cross_validation_split(dataset, folds):
    dataset_split = list()
    dataset_copy = list(dataset)
    fold_size = int(len(dataset) / folds)
    for i in range(folds):
        fold = list()
        while len(fold) < fold_size:
            index = randrange(len(dataset_copy))
            fold.append(dataset_copy.pop(index))
        dataset_split.append(fold)
    return dataset_split

# Get accuracy of prediction #
def getAccuracy(actual,predicted):
    correct = 0
    for i in range(len(actual)):
        if actual[i][-1] == predicted[i]:
            correct += 1
    return (correct / float(len(actual))) * 100.00

# Calculate a Confusion Matrix #
def confusion_matrix(actual, predicted):
    unique = set([row[-1] for row in actual])
    matrix = [list() for x in range(len(unique))]
    for i in range(len(unique)):
        matrix[i] = [0 for x in range(len(unique))]
    lookup = dict()
    for i, value in enumerate(unique):
        lookup[value] = i
    for i in range(len(actual)):
        x = lookup[actual[i][-1]]
        y = lookup[predicted[i]]
        matrix[x][y] += 1
    return unique, matrix

# Printing a confusion matrix
def print_confusion_matrix(unique, matrix):
    print('Unique prediction values:')
    print('(P)' + ' '.join(str(x) for x in unique))
    print('(A)---')
    print("Confusion Matrix:")
    for i, x in enumerate(unique):
        print("%s| %s" % (x, ' '.join(str(x) for x in matrix[i])))

# Recall classification estimator #
def recall_precision_calc(matrix):
    for i in range(len(matrix[0])):
        row_values = matrix[i] # row values of matrix
        col_values = [row[i] for row in matrix] # column values of matrix
        tp = col_values[i]
        fp = sum(row_values)-row_values[i] # sum all row values - ones in diagonal
        fn = sum(col_values)-col_values[i] # sum all col values - ones in diagonal
    
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)
    
    F1_score = 2 * (precision * recall) / (precision + recall)
    
    return recall, precision, F1_score

#Euclidean Distance
def EuclideanDistance(instance1, instance2, length):
    distance = 0
    for i in range(length):
        distance += pow(instance2[i]-instance1[i],2)
    return math.sqrt(distance)

def getNeighbors(trainingSet, testInstance, num_neighbors, distancetype, *args):
    distances = []
    length = len(testInstance)-1
    for i in range(len(trainingSet)):
        if distancetype == "Euclidean":
            dist = EuclideanDistance(testInstance, trainingSet[i], length)
      
        distances.append((trainingSet[i],dist))
    distances.sort(key=operator.itemgetter(1))
    #return distances
    neighbors = []
    for x in range(num_neighbors):
        neighbors.append(distances[x][0])
    return neighbors

#Classification from neighbors (Classification problem)
def getResponse(neighbors):
    classVotes = {}
    for x in range(len(neighbors)):
        response = neighbors[x][-1]
        if response in classVotes:
            classVotes[response] += 1
        else:
            classVotes[response] = 1
    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
    return sortedVotes[0][0]

def main():
    filename = 'iris.csv'
    dataset = load_csv(filename)
    # convert string columns to float
    for i in range(4):
        str_column_to_float(dataset, i)
        
    # convert class column to int
    lookup = str_column_to_int(dataset, 4)
    
    # normalization of dataset
    minmax = dataset_minmax(dataset)
    Normalize_Dataset(dataset, minmax)

    # Splitting dataset between Training and Testing Set
    split = 0.6
    trainingSet, testSet = train_test_split(dataset, split)
    
    predictions = []
    num_neighbors = 3
    for i in range(len(testSet)):
        neighbors = getNeighbors(trainingSet, testSet[i], num_neighbors, "Euclidean")
        classify = getResponse(neighbors)
        predictions.append(classify)
        print('> predicted=' + repr(classify) + ', actual=' + repr(testSet[i][-1]))
        
    #Accuracy Assessment
    accuracy = getAccuracy(testSet,predictions)
    print('Accuracy :' + repr(accuracy) + '%')
    unique, matrix = confusion_matrix(testSet, predictions)
    
    print_confusion_matrix(unique, matrix)
    
    Recall, Precision, F1_score = recall_precision_calc(matrix)
    print('Recall:', Recall)
    print('Precision:', Precision)
    print('F1 score:', F1_score)

main()

Pros and Cons of KNN

Pros of KNN

  • KNN can be used to solve both classifications as well as regression types of problems.
  • KNN algorithm can easily respond to changes in input in real-time.
  • It is also used to solve the multi-class classification problem.
  • There are no special criteria that we have to meet before applying KNN.

Cons of KNN

  • KNN works well in small no of input variables but as the no of variables grows it is very hard to predict the output of new data. This is also called a curse of dimensionality.
  • KNN algorithm needs normalized data.
  • It cannot deal with missing value problems.
  • The major issue with the KNN is to choose the optimal no of neighbors.

Wrap up the Session

In this tutorial we have learned about, what is knn algorithm and how does it works after that we learn about how to choose the optimal value of K. After that we learn about the difference between the lazy learners and eager learners. And then we implement the KNN algorithm on real-life examples using python and at last, we learn about the advantages and disadvantages of using KNN.

If you want to learn about machine learning, deep learning, natural language processing, computer vision. You can subscribe to my blog for getting notified when I release a new blog. If you liked this tutorial please like it, share it, and if you have any problem regarding the implementation or any topic feel free to leave a comment.