Last Update: 01 Jan 2022

Welcome to Part II of k-NN algorithm from Scratch. This sequence of quick articles will guide you towards understanding and building the k-Nearest Neighbours (k-NN) algorithm from Scratch using Python for classification and regression.

Please consider reading  “k-NN algorithm from Scratch: A Quick Intro, Part I, if you feel uncomfortable proceeding directly to Part II. These sequential posts are part of the topics discussed in my upcoming e-Book and Course at Aicavity Learn, which will available by the end of March 2022.

If you are interested in Algorithms, Artificial Intelligence, Machine Learning, Data Science & Analytics, Management, and Engineering, then please make sure to Subscribe to Aicavity Connect Newsletter!

 

We are aiming to explore the Biopsy Data on Breast Cancer Patients and use the k-NN to “predict” if some of the patients have developed a “malign” or “benign” breast cancer. Although this application is not recommended for accurate medical diagnosis, it will help us demonstrate the k-NN algorithm’s development from scratch. We will also perform tuning procedures to concisely enhance the prediction accuracy of these recommendations.

First, we need to build the basic structure of the k-NN algorithm by defining some functions in Python. 

We can start with initialize_parameters() and generate_random_dataset() which are responsible to generate our dataset with pre-fixed starting parameters such as the number of first neighbors k, number of columns of our matrix (n_col=number of features), the total number of rows in our random dataset (n_total) and the total number of rows in our test dataset (n_test). The function generate_random_dataset() creates matrices (n_col=2 columns) where n_total=500 rows for x_train and n_test = 10 rows for x_test. .

What is our goal?

The idea is to produce a random dataset, and the goal is to classify the feature vectors x_test (extracted from the dataset) based on the randomly generated classes for the x_train using the Euclidean and Hamming distance metrics. 


# Import Numpy and Matplotlib Libraries
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(3)

# Define some parameters to generate 
# our dateset randomly
def initialize_parameters():

# Total number of rows in our dataset
    n_total = 500

# Total number of features in our dataset
    n_col = 2 

# Total number of rows for classification (test dataset)
    n_test = 10 

# Number of first neighbors (k=5)
    k = 5 
    resol = 600 # dpi resolution to save images
    return n_total,n_col,n_test,k,resol
  
# Function to generate our dataset 
# randomly using np.random.rand()
def generate_random_dataset(n_total,n_col,n_test):

# Generate Matrixes with n_total rows and n_col columns=2
    x = np.random.rand(n_total*n_col) 

# Reshape the number of columns
    x = x.reshape((-1,n_col)) 
    x_t = np.random.rand(n_test*n_col) 
    x_t = x_t.reshape((-1,n_col))
    return x,x_t
  
# Initialize parameters
n_total,n_col,n_test,k,resol = initialize_parameters()

# Generate random datasets x_train and x_test
x_train,x_test= generate_random_dataset(n_total,n_col,n_test) 

Let us check the first 10 rows of our training dataset:

# First 10 values from the randomly generated dataset
print(x_train[:10])
<Out>: array([[0.5507979 , 0.70814782],
              [0.29090474, 0.51082761],
              [0.89294695, 0.89629309],
              [0.12558531, 0.20724288],
              [0.0514672 , 0.44080984],
              [0.02987621, 0.45683322],
              [0.64914405, 0.27848728],
              [0.6762549 , 0.59086282],
              [0.02398188, 0.55885409],
              [0.25925245, 0.4151012 ]])

Nevertheless, how can we visualize our dataset?

For this specific case, we have two features where the x-axis and y-axis represent the first and second components, respectively. These features are represented by the first and second columns of x_train. The next figure will show blue and red circles representing the distribution of x_train and x_test, respectively. We can plot the red circles (x_test) using the k-NN algorithm alongside the training dataset x_train (blue circles).

 

# Plot the distribution of feature 
# vectors from x_test(red) and x_train (blue)
fig = plt.figure(figsize=(7,5))
# Training dataset
xtrain = plt.scatter(x_train[:,0],x_train[:,1],s=50,alpha=0.5,color='blue') 
# Test dataset
xtest = plt.scatter(x_test[:,0],x_test[:,1],s=70,alpha=0.5,color ='red') # Test dataset
plt.xlabel('First Column',size=17)
plt.ylabel('Second Column',size=17)
plt.legend((xtrain,xtest),
           ('x_train', 'x_test'),
           scatterpoints=1,
           ncol=2,
           fontsize=12,loc='best', bbox_to_anchor=(0.2, 1))
plt.show(fig)
# Save Figure
fig.savefig('x_train_x_test.png',dpi=resol) 
… and we can generate the following distribution …

The blue circles represent the training dataset, and the red circles represent the testing dataset.

We have assigned random labels to the y_train vector regarding each feature vector of our partial dataset x_train, and our output y_train is a vector composed of 0’s and 1’s (binary classification). Our next step is to split x_train into x_label_0 and x_label_1, subsets of x_train with labels 0 and 1, respectively. Hence, we can again plot these subsets using contrasting colors in two dimensions, where the first column represents feature 1 (x-axis) and the second column represents feature 2 (y-axis).

Furthermore, we can build a dataset with the third column representing our label vector (qualitative response “Yes”=1 and “No”=0). We have defined the green and orange colors for the next plot as 0 and 1, respectively. The red circles represent elements of x_test which still needs to be classified. Let’s check the first five items of our dataset and split into x_label_1 and x_label_0 (lines 10-14 from split_x_binary() function):

# Generate a random vector with binary classification 

random_vector = np.random.rand(int(n_total))
random_vector = random_vector.reshape((-1,1))
y_train = np.where(random_vector>0.5,1,0)

# Concatenate the data:
dataset = np.concatenate([x_train,y_train],axis=1)

# Generate the boolean vector with True when # Y=1 and False when Y=0. 
# You could also do try: idx=y_train>0, idx = idx.reshape(-1)

idx = np.where(y_train>0,True,False).reshape(-1) 

# Record the values with label 1 and label 0 on # x_label_1, and x_label_0, respectively.
# ~ operator is the opposite boolean vector, which means TRUE = 0.

x_label_1=x_train[idx]
x_label_0=x_train[~idx] 

# The spliting of x_train into x_label_1 and 
# x_label_0 can be simplified by the function

def split_x_binary(x_train,y_train):
    y_train = np.array(y_train)
    idx = np.where(y_train>0,True,False).reshape(-1)
    x_label_1=x_train[idx]
    x_label_0=x_train[~idx]
    return x_label_0,x_label_1

We can double check the dataset …

print(dataset[:5])
<Out>: array([[0.5507979 , 0.70814782, 1.        ],
              [0.29090474, 0.51082761, 1.        ],
              [0.89294695, 0.89629309, 0.        ],
              [0.12558531, 0.20724288, 0.        ],
              [0.0514672 , 0.44080984, 0.        ]])
print( x_label_1[:5])
<Out>: array([[0.5507979 , 0.70814782], 
              [0.29090474, 0.51082761], 
              [0.02987621, 0.45683322], 
              [0.64914405, 0.27848728], 
              [0.6762549 , 0.59086282]])
print(x_label_0[:5])
<Out>: array([[0.89294695, 0.89629309],
              [0.12558531, 0.20724288],
              [0.0514672 , 0.44080984],
              [0.25925245, 0.4151012 ],
              [0.44045372, 0.15686774]])

… and use the function plot_scattering(x_label_0,x_label_1,x_test) to visualize the distributions composed by x_label_0, x_label_1 and x_test in different colors:

# We can plot the scattering distribution with 3 colors:
def plot_scattering(x_label_0,x_label_1,x_test):
    fig = plt.figure(figsize=(7,5))
    xtrain1=plt.scatter(x_label_1[:,0],x_label_1[:,1],s=50,alpha=0.5,color='orange')
    xtrain0=plt.scatter(x_label_0[:,0],x_label_0[:,1],s=70,alpha=0.5,color ='green')
    xtest=plt.scatter(x_test[:,0],x_test[:,1],s=100,alpha=0.5,color ='red') # Test
    plt.xlabel('First Column',size=17)
    plt.ylabel('Second Column',size=17)
    plt.legend((xtrain1,xtrain0,xtest),
               ('x_label_1','x_label_0', 'x_test'),
               scatterpoints=1,
               ncol=3,
               fontsize=12,
               loc='best', bbox_to_anchor=(0.2, 1))
    plt.show(fig)
    fig.savefig('split_x_test.png',dpi=resol)
    
plot_scattering(x_label_0,x_label_1,x_test)

How should we proceed?

We have generated our dataset using np.random.rand(), therefore, we do not need normalization. However, we will build a function to normalize our data as we will proceed to other applications.

Since you will need to analyse the following Python code, I believe it is important to mention that we have used a trick on Line 5 (code). The idea is to append a vector with the format ([min_1 min_2 … min_n],[max_1 max_2 … max_n]), which intrinsically depends on the shape of data_input.shape[1]. In our case, the shape is equal to two

Furthermore, Line 12 (Code) takes advantage of another trick to perform the normalization of the input data by broadcasting operations over data_input[:]. As an example, one operation would be: [1.1 2.3] – [min_1 min_2] / ([max_1 max_2]-[min_1 min_2]), where min_1 = minimum value of column 1, min_2 is the minimum value of column 2, max_1 represents the maximum value of column 1, and so forth. Please, check the code carefully, and try to understand the procedure. If you find a better way to compute or optimize any part of these articles, please do not hesitate in contacting me.

def normalize_data(data_input):
    min_max_vector = [] 
    
    # Append ([min_1 min_2 ... min_n],[max_1 max_2 ... max_n]) depending on data_input.shape[1]=2
    [min_max_vector.append([np.min(data_input[:][:,i]),np.max(data_input[:][:,i])]) 
     for i in range(data_input.shape[1])]
    
    min_max_vector = np.asarray(min_max_vector) # Transform in array
    
    # Normalization of the data_input by Broadcasting operations over data_input[:]
    # i.e [1.1 2.3] - [min_1 min_2] / ([max_1 max_2]-[min_1 min_2]), min_1 = minimum of column 1 and so forth
    normalized_dataset = (data_input[:]-min_max_vector[:][:,0])/(min_max_vector[:][:,1]-min_max_vector[:][:,0])
        
    return normalized_dataset
  
 # Normalize the input
x_train = normalize_data(x_train) 

After introducing the Euclidean distance metrics in the k-NN algorithm: A Quick Intro, Part I, we just need to rewrite it as a function euclidean_distance(x_row,y_row). 

The next step is to build the Hamming distance metrics.  The Hamming metrics measure the number of distinct positions when comparing two feature vectors represented in binary. Therefore, each float number is transformed into a binary representation with 32-bits. We need to extract the optimal (minimum) number of substitutions required to convert a 4 bytes vector into another as we compute the distance between two sequences of 4 bytes. For example, if the input vector has ten features, we will reshape it as a40 bytes and compare it with the other 40 bytes feature vectors.  Suppose that we have 1,000,000 data points, 200 features, and 40 bytes per feature. In this case, we will end up with a quite large matrix corresponding to 40*200*1,000,000 bytes = 8 Gigabyte of memory:

i.e., Input x_row = [1,1], y=[1,1] is transformed to: 

x_row = y_row [[0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]].

For this specific case, the hamming distance metric is 0 since we don’t need to substitute any bit because both vectors are similar.

 

# Euclidean distance metric
def euclidean_distance(x_row,y_row):
    output_distance = np.sqrt(np.sum((x_row-y_row)**2)) # Element-wise sum and square root
    return output_distance
  
from struct import *
# Hamming distance metrics
def hamming_distance(x_row,y_row): 
    x_bin = []
    y_bin = []
    # sum_list = []
    for idx in range(len(x_row)):
        x_int32bits = np.asarray(x_row[idx], dtype=np.float32).view(np.int32).item()
        x_int32bits = list(map(int, '{:032b}'.format(x_int32bits)))
        y_int32bits = np.asarray(y_row[idx], dtype=np.float32).view(np.int32).item()
        y_int32bits = list(map(int, '{:032b}'.format(y_int32bits)))
        x_bin.append(x_int32bits)
        y_bin.append(y_int32bits)
    
    binx,biny = np.asarray(x_bin),np.asarray(y_bin)
    sum_=np.sum(np.reshape(binx,(-1))!=np.reshape(biny,(-1)))
    return sum_

   # You could also rewrite the previous
   # task substitute the last line sum_ for the 
   # following less efficient version:
   # add sum_list = [] before the for loop.
   # and add the next lines before return sum_
   # [sum_list.append(np.sum(binx[:][idx]!=biny[:][idx])) for idx in range(len(binx))]
   # sum_ = np.sum(np.asarray(sum_list))

Suppose we have 1,000,000 data points, 200 features, and 40 bytes by feature. Therefore, we end up with a quite large matrix that corresponds to 40*200*1,000,000 bytes = 8 Gigabyte of  memory.

Now we can check the distance between x_train[1] and the first five feature vectors (n_of_index=5) from x_train[:5] using both Euclidean and Hamming distances:

#Euclidean distance: 

# Create an essential list to append 
# the Euclidean distances
dist_metrics = []

# Calculate this number of distances
n_of_idx = 5 
[dist_metrics.append(euclidean_distance(x_train[1],x_train[idx])) 
      for idx in range(n_of_idx)]
dist_metrics
<Out>: [0.3277255721523581, 
        0.0,
        0.7180612973191687,
        0.34692592744974166,
        0.2506733036349303]

# Hamming distance:
# Create an essential list to append 
# the Hamming distances
dist_metrics_hamming = []

# Calculate this number of distances
n_of_idx = 5
[dist_metrics_hamming.append(hamming_distance(x_train[1],x_train[idx])) 
      for idx in range(n_of_idx)]
print(dist_metrics_hamming)
<Out>: [27, 0, 29, 22, 32]

What is the next step? Let’s find the first neighbors …

We will create the function find_neighbors(), where the input parameters are x_test, x_train, the labels for the training dataset y_train, the parameter k=5 from initialized_parameters(), and the distance metrics, which can be either euclidean_distance or hamming_distance. The embedded loop from line 8 will be responsible for estimating the distances for the list distance_list. Line 13 sort the rows based on distance taking into account the descending order. Furthermore, lines 16 and 17 select the first neighbors and their respective labels.

# Function to find k first neighbors.
def find_neighbors(data_test,data_train,y_train,number_of_neighbours,distance_metrics):
    
    distance_list = [] # List to append euclidian distances
    new_distance_list = np.array([]) # Create an array to transform distance_list 
    
    # Calculate the distances and save on the distance_list
    for idx,value in enumerate(data_train): # for loop with index and value from data_train 
        distance = distance_metrics(data_test,value) # Calculate the distances or #distance_list.append((data_test,row,distance))
        distance_list.append((value,distance,y_train[idx])) # Append the distances and respective rows to a list

    distance_list = np.array(distance_list) # transform list to array
    new_distance_list = distance_list[distance_list[:,1].argsort(kind='mergesort')] # Sort rows based on distance
    
    # Select the first neighbors
    x_first_neighbours = new_distance_list[:number_of_neighbours,:new_distance_list.shape[1]-2]
    y_first_neighbours = new_distance_list[:number_of_neighbours,new_distance_list.shape[1]-1] 
 
    return x_first_neighbours,y_first_neighbours
We can check the first k=5 neighbors by considering x_train[0] as the reference:
#Euclidean distance:
find_neighbors(x_train[0],x_train,y_train,k,euclidean_distance)
<Out>: (array([[array([0.5527938 , 0.70827963])],
        [array([0.53004823, 0.70517651])],
        [array([0.58029565, 0.694854  ])],
        [array([0.57552994, 0.73191281])],
        [array([0.56473116, 0.67598995])]], dtype=object),
 
<Out>: array([array([1]), array([0]), array([1]), array([0]), array([0])],
       dtype=object))

#Hamming distance:
find_neighbors(x_train[0],x_train,y_train,k,hamming_distance)
<Out>: (array([[array([0.5527938 , 0.70827963])],
        [array([0.53526396, 0.17293526])],
        [array([0.58029565, 0.694854  ])],
        [array([0.56914353, 0.61141693])],
        [array([0.60368971, 0.61431822])]], dtype=object),
<Out>: array([array([1]), array([0]), array([1]), array([1]), array([1])],
       dtype=object))

Now, we can predict the most representative label for each input vector. Now, we can build the function predict(), which will enable us to find the first neighbors.

def predict(data_test, data_train,y_train,number_of_neighbors,distance_metrics):
    
    x_neighbors,y_neighbors = find_neighbors(data_test,data_train,y_train,number_of_neighbors,distance_metrics)
    y_neighbors = list(y_neighbors)
    
    output = max(y_neighbors,key=y_neighbors.count) 
    return output
  
def knn(x_test,x_train,y_train,number_of_neighbors,distance_metrics):
    
    x_test_prediction = [] # Create a list to append predictions
   
    for idx,value in enumerate(x_test):
        out_predict = predict(value,x_train,y_train,number_of_neighbors,distance_metrics)
        x_test_prediction.append(out_predict)
        
    return x_test_prediction

Line 8 is responsible for counting the number of 1’s and 0’s from the list of first neighbors and outputting the one with more repetitions. We have built the function knn(), which is responsible for estimating the overall prediction. This function includes the possibility of inputting either a single feature vector or a tensor.

Our results suggest that the hamming distance is more accurate when running the function predict() with x_train[0] parameter as the testing reference for k=5.

predict(x_train[0],x_train,y_train,k,euclidean_distance)
<Out>: array([0])

predict(x_train[0],x_train,y_train,k,hamming_distance)
<Out>: array([1])

We can backtest the prediction with x_train[0] as the test data by varying the k from 1 to 40:

hamming_prediction = []
number_of_k = 40
for value in range(number_of_k):
    hamming = predict(x_train[0],x_train,y_train,value+1,hamming_distance)
    hamming_prediction.append(hamming)

print(hamming_prediction)
predict(x_train[0],x_train,y_train,k,hamming_distance)
<Out>: array([1])

But how to estimate the accuracy? 

We have built the function accuracy() to estimate our prediction errors, as well as another specific function to plot the distribution of predictions using both the Hamming and Euclidean norms:

#Function to estimate the accuracy 
def find_accuracy(y_pred,y_test):
    acc = np.mean(y_pred == y_test)
    return acc
  
# Plot the scattering with 3 colors:
def plot(x_label_0,x_label_1,method):
    fig = plt.figure(figsize=(7,5))
    xtrain1=plt.scatter(x_label_1[:,0],x_label_1[:,1],s=100,alpha=0.5,color='orange')
    xtrain0=plt.scatter(x_label_0[:,0],x_label_0[:,1],s=100,alpha=0.5,color ='green')
    plt.xlim([0,1])
    plt.ylim([0,1])
    plt.title('Prediction of Red dots for k=5'+ method)
    plt.xlabel('First Column',size=17)
    plt.ylabel('Second Column',size=17)
    plt.legend((xtrain1,xtrain0,xtest),
               ('x_split_'+method+'_label_1','x_split_'+method+'_label_0'),
               scatterpoints=1,
               ncol=1,
               fontsize=12,
               loc='best')
    plt.show(fig)
    fig.savefig(method+'prediction.png',dpi=resol)
If we consider the prediction of the first ten feature vectors x_train[:10] using the Euclidean distance, and run a loop over the first ten feature vectors of x_train by considering k varying from 1 to 40:
y_pred = knn(x_train[:10],x_train,y_train,k,euclidean_distance)
find_accuracy(y_pred,y_train[:10]) 
<Out>: 0.6

k_max = 40
max_lim = 10
accuracy = []
for idx in range(k_max):
    y_pred = knn(x_train[:max_lim],x_train,y_train,idx+1,hamming_distance)
    acc = find_accuracy(y_pred,y_train[:max_lim])
    accuracy.append(acc)

print(accuracy)
<Out>:[1.0, 1.0, 0.7, 1.0, 0.8, 0.9, 
       0.8, 0.9, 0.9, 0.9, 0.8, 0.9, 
       0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 
       0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 
       0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 
       0.7, 0.7, 0.7, 0.8, 0.7, 0.7, 
       0.7, 0.8, 0.7, 0.7]

We conclude that the Hamming distance has higher prediction rates for the first ten elements than the first k=1,…,40 as we use the whole dataset. Additionally, we used the x_test from both Hamming and Euclidean distances to visualize our predictions.

y_pred_euclidean = knn(x_test,x_train,y_train,k,euclidean_distance
y_pred_hamming = knn(x_test,x_train,y_train,k,hamming_distance)
x_test_prediction = np.concatenate((x_test,y_pred_euclidean),axis=1)
x_test_prediction # This is our prediction
<Out>: array([[0.28118562, 0.33542246, 1.        ],
              [0.31755432, 0.56551063, 0.        ],
              [0.65543402, 0.84481462, 0.        ],
              [0.37989835, 0.93799504, 1.        ],
              [0.38106703, 0.42334526, 1.        ],
              [0.41426164, 0.76466165, 1.        ],
              [0.73779091, 0.42234548, 0.        ],
              [0.34263992, 0.0824606 , 0.        ],
              [0.25380377, 0.07552576, 1.        ],
              [0.33259554, 0.18939138, 0.        ]])

But how can we visualize those predictions?

Simply by splitting the Euclidean and Hamming predictions into datasets with labels 0 and 1 to facilitate the plot of the distributions:

# split the data to plot (Euclidean distance prediction)
x_split_euclidean_label_0,x_split_euclidean_label_1 = 
      split_x_binary(x_test,y_pred_euclidean)

# split the data to plot (Hamming distance prediction)
x_split_hamming_label_0,x_split_hamming_label_1 = 
      split_x_binary(x_test,y_pred_hamming)

plot_scattering(x_label_0,x_label_1,x_test)
plot(x_split_euclidean_label_0,x_split_euclidean_label_1,' Euclidean ')
plot(x_split_hamming_label_0,x_split_hamming_label_1,' Hamming ')

Finally, we have clustered our predictions from a random dataset x_test with two classes, 1 (orange) and 0 (green), by considering k=5 first neighbors and two different distance metrics. When applied to our dataset, the Hamming distance metrics have shown higher accuracy for the forecasts while backtesting using multiple k’s. In this session, you could visualize the configurational space with only two features, where the colors represent the labels. These built-in functions will enable us to explore higher-dimensional datasets (multiple features). 

The following articles will dive into two different datasets: the IRIS flower dataset and the Biopsy Data on Breast Cancer Patients. We will use the stochastic cross-validation, ROC curves, and Confusion Matrix to analyze our dataset.

Did you enjoy this article? Subscribe to our Newsletter.

You will receive notifications with fresh content, new courses and other activities.


Show CommentsClose Comments

Leave a comment