Last Update: 05 Aug 2022

This sequence of articles will guide through the k-Nearest Neighbours (k-NN) algorithm using two different approaches: (i) k-NN for classification and (ii) regression. 

By the end of this series, you will be able to use the k-NN algorithm to “predict” if patients have developed a “malign” or “benign” breast cancer using the Biopsy Data on Breast Cancer Patients dataset. It is essential to highlight that our approach only demonstrates the k-NN algorithm; therefore, we do not use it for actual medical diagnosis. Additionally, we will also perform tuning procedures to enhance accuracy concerning our predictions.

Suppose that our task is to find the variable Y such that Y is a qualitative outcome, answering a possible inquiry: I know you have a perfect playlist on Spotify. However, if I do recommend the Rage Against the Machine album, would you “like” it or not? “Yes,” or “No,”?

Top-tech companies such as Spotify, Google, Amazon, Netflix, Youtube, Meta, etc., take advantage of classification algorithms to improve personalized predictions of products and enhance customer engagement in their platforms. Recommendation systems are everywhere.

Although this question look “basic” from a first-principle perspective, extracting consistent, robust, and meaningful features from messy datasets can be an incredibly daunting and complex task. Therefore, data curation, management, treatment of outliers, and erroneous assumptions become daily procedures for successful products, development, and management teams.

Generally, handling qualitative or categorical problems by tuning classification algorithms can boost personalized recommendations and improve qualitative responses to specific questions. We can improve engagement and revenues by assigning the client’s observations and behavior to classes or categories with better accuracy. Nevertheless, that is not so simple because, generally speaking, it is hard to deploy these algorithms into production since most infrastructures are built up on top of legacy code that has not been updated for “centuries”. This is a topic for MLOps (Machine Learning Operations) and Machine Learning Pipelines discussions, and I will leave it open for another Article.

… data curation, management, treatment of outliers, and erroneous assumptions is part of a successful application.

As a side mention, we will explore multiple classifiers in future articles that can be used for recommendation purposes, their respective advantages, disadvantages, specific nuances while deploying from scratch, and the use of libraries and packages in Python that can help you backtest the algorithms. One can often use classification and regression techniques to determine the probability of an outcome category. This procedure is critical to classify a specific income variable. In our case, we will explore the k-Nearest Neighbours, which sometimes can be unstable, but very often shows excellent performance on large datasets. Why?

The advantage of k-NN is that data can be continuously updated with new queries for recommendation purposes. It does not take previous assumptions from parametric functions f(X) and specific decision boundaries. This approach is very different from the regression mechanism where we need to fit a linear or non-linear function f(X) that minimizes the sum of squared errors. 

k-NN is also known as a “lazy learner” because it does not need to be trained, as it efficiently tracks optimal metrics on the fly, which is highly advantageous for production datasets. Furthermore, the parametric form f(X) is approximated locally. It is important to highlight that k-NN is classified as a non-probabilistic method, where the output is the assignment of objects to a specific or possibly multiple classes, which means that our products can be a binary category (with two classes K=2) such as “Yes” or “No,” or merely multiple classes for K>2, like “Yes,” “No” and “Maybe” ;). We will deal with numerous classes in future posts by introducing the hot-encoding output vector.

Let’s illustrate the concepts of k-NN through a little tiny simple and biased example.

Consider that someone has just launched an online store, and four clients have bought a mouse for their devices. The computational infrastructure records four pieces of information about each customer. Client 1 just subscribed to the online store and added multiple items to his wishlist scattered over different categories, such as electronics, clothes, shoes, and food. Hence, we are presented with a little tiny dataset composed by five feature vectors x_1,..,x_5 where each vector characterizes one client’s behavior, and each vector has four features x_i=[x_{1,i},x_{2,i},x_{3,i},x_{4,i}] which represents the like for electronics, clothes, shoes and food. These features can also be age, gender (Male=0, Female = 1), frequency of people from the same county that bought the same device, and broad category-based wishlist. The wishlist can be transformed into metrics to represent the customers’ interest in multiple items. We can also engineer features based on the rate of likes from a specific category compared to all categories (i.e., from category electronics). Let us assume the following feature vectors for each customer:

Client 1: x_1=[x_{1,1},x_{2,1},x_{3,1},x_{4,1}]

Client 2: x_2=[x_{1,2},x_{2,2},x_{3,2},x_{4,2}]

Client 3:  x_3=[x_{1,3},x_{2,3},x_{3,3},x_{4,3}]

Client 4:  x_4=[x_{1,4},x_{2,4},x_{3,4},x_{4,4}]

Client 5:  x_5=[x_{1,5},x_{2,5},x_{3,5},x_{4,5}]

and we want to know if Client 1 would buy a computer mouse from your store. We have a binary choice: “Yes” or “No” which can be represented by: “Yes”=1 and “No” = 0.

Since Clients 2 and 4 have bought the device, we have created the following target vector:

{y}=[y_1^{Client 1},y_2^{Client 2},y_3^{Client 3},y_4^{Client 4},y_5^{Client 5}]


We can visualize our dataset as a matrix where the columns are age, gender, rate of item per county, and category-based wishlist, respectively:

{\bf{x}} = \begin{bmatrix} \bf{x}_1^{C1} \\ \bf{x}_2^{C2} \\ \bf{x}_3^{C3} \\ \bf{x}_4^{C4} \\ \bf{x}_5^{C5} \end{bmatrix}=\begin{bmatrix} x_{1,1} & x_{2,1} & x_{3,1} & x_{4,1} \\ x_{1,2} & x_{2,2} & x_{3,2} & x_{4,2} \\ x_{1,3} & x_{2,3} & x_{3,3} & x_{4,3} \\ x_{1,4} & x_{2,4} & x_{3,4} & x_{4,4} \\ x_{1,5} & x_{2,5} & x_{3,5} & x_{4,5} \end{bmatrix}

{\bf{x}} = \begin{bmatrix} 22 & 0 & 0.6 & 0.95 \\ 18 & 0 & 0.8 & 0.3 \\ 29 & 1 & 0.3 & 0.1 \\ 25 & 1 & 0.3 & 0.4 \\ 30 & 1 & 0.9 & 0.85 \end{bmatrix}

Now, we just need to choose a distance metrics (Euclidean distance) to estimate how close the Client 1 feature vector is from Clients 2, 3, 4, and 5. If we consider only the first k=2 neighbors, then we can filter two clients with the closest distance metrics.

The Euclidean norm is also known as the L2 norm which basically computes the distance between two vectors. If we consider the n-dimensional Euclidean space {\displaystyle \mathbb {R} ^{n}}\mathbb {R} ^{n}, the length between two vectors x = [x_1, x_2, …, x_n]


 x’ = [x’_1, x’_2, …, x’_n]

can be described as: 

||x||_{l2} = \sqrt{(x_1-x’_1)^2+…+(x_n-x’_n)^2} 

||x||_{l2} = \sqrt{\sum_{i=0}^{n} {(x_i-x’_i)}^2} 

where n represents the total number of features in our dataset. Before we proceed to the next step, it’s extremely important to highlight that we need to normalize our dataset. This procedure avoids first nearest neighbors to be aligned in the direction of the axis with the smaller range, which could lead to an incorrect classification. Hence, we will normalize our feature values based on the maximum and minimum values from each column, such that:

x_{i,j} = \dfrac{x_{i,j}-min(x_{j})}{(max(x_{j})-min(x_{j}))}

where max(x_{j}) and min(x_{j}) represents the maximum and minimum value for the j-th column.

 We can do it quickly using Python and the Numpy Library …

# Import Numpy Library
import numpy as np

# Generate the feature vector
feature_x = np.array([[22,0,0.6,0.95],[18,0,0.8,0.3],

# Compute the normalized matrix
for idx in range(len(feature_x)-1):

# Compute Euclidian distances
euclidean_distances = []
for i in range (len(feature_x)-1):
    dist = np.sum((feature_x[i+1]-feature_x[0])**2)

 Euclidean distances between feature vectors using Client 1 as the reference, and Clients 2, 3, 4 and 5 as targets:

d_{k=1,j=3}^{C1,C3}=\sqrt{\sum_{i=1}^{n=4}(x_{3,i}-x_{1,i})^2} \approx 2.423

d_{k=1,j=4}^{C1,C4}=\sqrt{\sum_{i=1}^{n=4}(x_{4,i}-x_{1,i})^2} \approx 1.696 

d_{k=1,j=5}^{C1,C5}=\sqrt{\sum_{i=1}^{n=4}(x_{5,i}-x_{1,i})^2} \approx 1.708

We have found that the k=2 first neighbors are Clients 2 and 4 since they present lower Euclidian distances than Client 3 and Client 5. Our target vector indicates that Clients 2 and 4 bought a computer mouse, which means that y_{output}^{k=2} = [1,1]. Therefore, we can choose the most representative class for Client 1 based on the output vectors’ probability. We have found two “Yes” = 1 out of two first neighbors in the output vector, representing a 100% probability that Client 1 will buy the computer mouse. It is important to highlight that this dataset was created to help you understand the fundamentals. Furthermore, multiple distance metrics can also be explored to classify or predict the output vectors. 

These are the fundamentals! Now, we will explore the application of k-NN on the Biopsy Data on Breast Cancer Patients. The following Quick Article will be dedicated to building functions, exploring different distance metrics, and introducing broadcasting to speed up code execution by avoiding loops. To tune our model, we will also deal with the “stochastic” cross-validation k-NN, ROC curves, and Confusion Matrix.

By the end of the k-NN series, I will also recommend some exciting resources such as books and articles to help you extend and dive deeper into the topic.

Did you enjoy this article? Subscribe to our Newsletter.

You will receive notifications with fresh content, new courses and other activities from Aicavity Platform.

Show CommentsClose Comments

Leave a comment