Last Update: 01 Jan 2022
Welcome to Part III of the k-NN algorithm from Scratch: Application.
This session will explore the k-fold cross-validation procedure with an application of the k-NN algorithm using the IRIS flower dataset.
We will also use the built-in functions developed in my previous quick articles. Please consider reading both k-NN algorithm from Scratch: Part I and Part II, if you feel uncomfortable proceeding directly into Part III.
This sequence of posts will guide you towards understanding and building the k-Nearest Neighbours (k-NN) algorithm from scratch using Python for two different purposes: classification and regression. It is also part of the upcoming Machine Learning and Deep Learning course at Aicavity Learn Platform.
Subscribe to Aicavity Connect Newsletter, and you will receive notifications about fresh content, new courses, and other activities.
What is our goal?
We are aiming to explore the Biopsy Data on Breast Cancer Patients in the last article and use the k-NN to “predict” if some of the patients have developed a “malign” or “benign” breast cancer. We will also perform tuning procedures to enhance the prediction accuracy of these recommendations. However, in this article, we will explore the IRIS flowers dataset. First, download the dataset by clicking on the link below.
Download Iris Flower Database (GitHub)
The Iris Flowers is a well-known database in the pattern recognition community, and it first appeared in the seminal paper by Fisher [1]. Although we do not need this article throughout our experiment, if you are interested, I recommend you go through the references [1-5].
What do I need to know?
This dataset contains three classes with 50 plants (rows) each, where each class is assigned to one type of plant. Therefore, we are presented with 150 rows composed of four features (attributes) each.
What are those features?
1. sepal length in cm; 2. sepal width in cm; 3. petal length in cm; 4. petal width in cm; 5. class: (i) Iris Setosa, (ii) Iris Versicolour, and (iii) Iris Virginica
For more information, please download the dataset and read the file iris. names.
First, we will use the function open_csv() to load our dataset, split it into 150 rows with four feature vectors (x_train), along with classes that can be 0,1, or 2. It would be best to substitute the path assigned to the variable filename according to the folder (/iris.csv). Line 14 from open_csv(), [dataset.append(row) for row in file if row] will filter only the rows without variables holding the value None.
# K-nearest neighbors on the Iris Flowers Dataset
# Import Libraries and iris.csv file
import csv
import numpy as np
np.random.seed(2)
# Define a Function to load the csv files:
def open_csv():
filename = 'iris.csv'
dataset = [] # List for appending values in the loop
with open(filename, 'r') as csvfile:
file = csv.reader(csvfile) #(delimiter=' ', quotechar = '|')
[dataset.append(row) for row in file if row]
dataset = np.asarray(dataset)
return dataset
dataset = open_csv()
The following functions were partially introduced in Part II of this series. It is essential to highlight that the normalization of data_input is done by broadcasting operations over the vector data_input[:]. The x_train and y_train are defined by splitting the dataset into x_train and y_train (label vector). Line 19 will filter the dataset columns corresponding to the labels. Further, lines 23-28 create a set to transform the classes into numbers 0, 1, and 2.
Additionally, we have a loop to generate a dictionary with indexes and keys from the set. Subsequentially, we modify the column with the classes based on the values of this dictionary. In lines 35-37, we split the dataset into x_train and y_train, as follows:
# Find Maximum and Minimum values for each column and normalize the feature vectors
def normalize_data(data_input):
min_max_vector = []
[min_max_vector.append([np.min(data_input[:][:,i]),np.max(data_input[:][:,i])])
for i in range(data_input.shape[1])]
min_max_vector = np.asarray(min_max_vector)
# Normalization of the data_input by Broadcasting operations over data_input[:]
normalized_dataset = (data_input[:]-min_max_vector[:][:,0])/(min_max_vector[:][:,1]-min_max_vector[:][:,0])
return normalized_dataset
# Initialize Dataset by splitting into x_train and y_train (label vector)
def initialize(dataset):
# Define the Column with classification Strings or Integers
classes_column = (dataset.shape[1]-1)
class_values = dataset[:][:,classes_column]
find_classes = set(class_values) # Filter the classes and create a set.
dict_classes = dict() # generate a dictionary
# Create dictionary with classes 0,1,2 ..., n.
for idx, key in enumerate(find_classes): # Loop to fill the dictionary with index and key from the set
dict_classes[key] = idx
for row in dataset: # modify the dataset class column based on the dictionary value
row[classes_column] = dict_classes[row[classes_column]]
dataset = dataset.astype(np.float) # Transform array items (dataset) from string to float
# Splitting the dataset into x_train and y_train (Classification)
y_train = dataset[:][:,(dataset.shape[1]-1)].astype(np.int)
x_train = dataset[:][:,:(dataset.shape[1]-1)].astype(np.float)
return dataset,x_train,y_train
After loading these functions and assigning the variables data, x_train ( 4 feature vectors ), y_train (classes), we can compute the first five normalized feature vectors from dataset x_train:
# Extract data, x_train, and y_train through the inicialization of parameters.
data,x_train,y_train = initialize(dataset)
# Normalize the input
x_train = normalize_data(x_train)
print(x_train[:5])
<Out>: [[0.22222222 0.625 0.06779661 0.04166667]
[0.16666667 0.41666667 0.06779661 0.04166667]
[0.11111111 0.5 0.05084746 0.04166667]
[0.08333333 0.45833333 0.08474576 0.04166667]
[0.19444444 0.66666667 0.06779661 0.04166667]]
We will use a “stochastic” approach for the k-NN by generating random indexes in the format of a matrix with multiple folds (batches) to perform the k-fold cross-validation.
def k_fold_cross_validation(x_train,y_train,number_of_folds):
start = 0
size = len(x_train)
size_of_fold = int(len(x_train)/number_of_folds)
index_matrix = np.random.randint(start,size,size=(number_of_folds,size_of_fold))
x_split = x_train[index_matrix]
y_split = y_train[index_matrix]
return x_split,y_split
As a reminder, our dataset is composed of 150 rows, and if we define the number_of_folds = 10, then the size_of_fold = int(len(x_train)/number_of_folds) = 15. In this case we will have 15 batches with 10 rows and 4 features vectors.
x_split.shape,y_split.shape
<Out>: ((10,15,4), (10,15))
print(x_split[1])
<Out>: [[0.16666667 0.45833333 0.08474576 0. ]
[0.77777778 0.41666667 0.83050847 0.83333333]
[0.19444444 0.66666667 0.06779661 0.04166667]
[0.83333333 0.375 0.89830508 0.70833333]
[0.58333333 0.5 0.59322034 0.58333333]
[0.55555556 0.375 0.77966102 0.70833333]
[0.02777778 0.41666667 0.05084746 0.04166667]
[0.33333333 0.91666667 0.06779661 0.04166667]
[0.63888889 0.375 0.61016949 0.5 ]
[0.66666667 0.54166667 0.79661017 0.83333333]
[0.41666667 0.29166667 0.52542373 0.375 ]
[0.36111111 0.20833333 0.49152542 0.41666667]
[0.36111111 0.41666667 0.52542373 0.5 ]
[0.22222222 0.75 0.10169492 0.04166667]
[0.38888889 0.41666667 0.54237288 0.45833333]]
The next step is simply to introduce the set of built-in functions that we have previously developed in Part II. These functions are: (i) euclidean_distance(), (ii) find_accuracy(), (iii) find_neighbors(), (iv) predict() and (v) knn().
# Euclidean distance metrics
def euclidean_distance(x_row,y_row):
output_distance = np.sqrt(np.sum((x_row-y_row)**2)) # Element-wise sum and square root
return output_distance
# Accuracy between Forecasting and Testing classes - np.array vectors
def find_accuracy(y_pred,y_test):
acc = np.mean(y_pred == y_test)
return acc
### Approach using for loop
def find_neighbors(data_test,data_train,y_train,number_of_neighbours):
new_distance_list = np.array([])
distance_list = [] # List to append euclidian distances along with the row variables
for idx,value in enumerate(data_train):
distance = euclidean_distance(data_test,value) # Euclidian distances or #distance_list.append((data_test,row,distance))
distance_list.append((value,distance,y_train[idx])) # Append the distances and respective rows to a list
distance_list = np.array(distance_list)
new_distance_list = distance_list[distance_list[:,1].argsort(kind='mergesort')] # Sort rows based on distance
# Select the first neighbours
x_first_neighbours = new_distance_list[:number_of_neighbours,:new_distance_list.shape[1]-2] # Select only the first neighbours
y_first_neighbours = new_distance_list[:number_of_neighbours,new_distance_list.shape[1]-1]
return x_first_neighbours,y_first_neighbours
def predict(data_test, data_train,y_train,number_of_neighbors):
x_neighbors,y_neighbors = find_neighbors(data_test,data_train,y_train,number_of_neighbors)
#doutput = [row[-1][data_test.shape[0]-1] for row in x_neighbors] # in case y_train is attached to the vector
y_neighbors = list(y_neighbors)
output = max(y_neighbors,key=y_neighbors.count)
return output
# Algorithm to estimate the prediction
def knn(x_train,y_train,x_test,number_of_neighbors):
x_test_prediction = []
#print(x_train[:5],y_train[:5],x_test,number_of_neighbors) # Try to see what comes out from here!
if(len(x_test)==4):
out_predict = predict(x_test,x_train,y_train,number_of_neighbors)
return(out_predict)
else:
for idx,value in enumerate(x_test):
out_predict = predict(value,x_train,y_train,number_of_neighbors)
x_test_prediction.append(out_predict)
#List with value and out_predict
#x_test_prediction.append((value,out_predict))
return x_test_prediction
return None
It is high time to play with the k-fold cross-validation!
Before starting, I highly recommend checking Chapter 7 of Hastie et al. [7,11] for a more in-depth introduction.
The k-fold cross-validation is usually used with k-NN as a tuning procedure for the hyper-parameter k. For the k-NN, the latter can determine which value of k provides more accurate predictions. The main idea is to measure the error rates for different random batches of our dataset using different values of k. The optimized k corresponds to the lowest error rate among those different batches. In our case, we randomly split the 150 rows dataset into ten folds (batches) with 15 rows each. Hence, we first run Iteration 1 using the first fold (batch 1 in red color) as the validation dataset and use folds 2-10 as the training dataset. Iteration 2 corresponds to fold 2 (batch 2 in red color) as the validation dataset, while folds 1,3,4,..,10 represent the training dataset. This procedure continued until the validation dataset was assigned to all possible folds of our dataset, which means N=10 times. Therefore, we can estimate the error rate by holding a subset of the training dataset as the validation dataset at each iteration. The last step is to take an average over those ten predictions. The figure below shows a diagram of the k-fold cross-validation approach:

Therefore, we can evaluate the performance of the k-NN algorithm by exploring the k-fold cross-validation:
# Evaluate an algorithm using a cross validation split
def cross_validation(x_train,y_train, number_of_folds, number_of_neighbors):
accuracy = []
average_acc = np.array([])
x_cross,y_cross = k_fold_cross_validation(x_train,y_train,number_of_folds)
# Transform x_cross into shape (150,4) and y_cross into shape (150,)
# x_cross_train,y_cross_train = np.reshape(x_cross,((-1,x_cross.shape[2]))),np.reshape(y_cross,([-1]))
inv_matr_identity = np.where(np.identity(number_of_folds,dtype=int)==1, False, True)
for idx in range(number_of_folds):
x_fold_train,y_fold_train = x_cross[inv_matr_identity[idx]],y_cross[inv_matr_identity[idx]]
x_t,y_t = np.reshape(x_fold_train,(-1,x_cross.shape[2])),np.reshape(y_fold_train,((-1)))
forecast = knn(x_cross[idx],x_t,y_t,number_of_neighbors)
print('Prediction Fold:',idx,forecast)
acc = find_accuracy(forecast,y_cross[idx])
accuracy.append(acc)
average_acc = np.average(np.asarray(accuracy))
print('Average Prediction',average_acc)
return x_cross,y_cross,accuracy,average_acc
The cross_validation() first generates the x_cross and y_cross by loading the function k_fold_cross_validation(). Furthermore, we build an identity matrix with the size = number_of_folds filled with booleans True=0 and False=1. This allows us to extract each fold independently from the x_cross during the loop, and to compute the cross-validation through the Line:
inv_matr_identity =
np.where(np.identity(number_of_folds,dtype=int)==1, False, True)
The next step is to create a loop through the folds and use the output from knn() as an input to find_accuracy() to forecast each fold’s error rates.
ps: We have divided the dataset into five folds with 30 elements each. These folds are based on random indexes from generated from the dataset with 150 vectors. We can see five Folds with their respective predictions in the next cell. These vectors are compared to the original vector from the dataset, which gives ~97% of accuracy using k=5 first neighbors.
number_of_folds = 5
number_of_neighbors = 5
x_c,y_c,acc_list,ave_acc
= cross_validation(x_train,y_train,number_of_folds,number_of_neighbors)
<Out>: Prediction Fold: 0 [2, 2, 0, 2, 2, 0, 1, 0, 0, 1, 0, 0, 2, 2, 2, 1, 2, 0, 2, 0, 2, 1, 2, 2, 1, 1, 0, 0, 0, 1]
<Out>: Prediction Fold: 1 [2, 2, 1, 2, 2, 0, 2, 2, 1, 1, 0, 1, 1, 1, 0, 1, 1, 2, 2, 1, 1, 0, 0, 1, 0, 2, 2, 1, 2, 0]
<Out>: Prediction Fold: 2 [0, 0, 0, 2, 2, 1, 2, 1, 0, 0, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 2, 1, 0, 2, 2, 2, 0, 0, 1]
<Out>: Prediction Fold: 3 [2, 1, 1, 2, 2, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 1, 1, 0, 2, 2, 0, 0, 1, 0, 1, 0]
<Out>: Prediction Fold: 4 [1, 2, 1, 1, 0, 0, 2, 0, 0, 2, 2, 1, 1, 1, 0, 0, 1, 2, 1, 0, 0, 1, 0, 0, 0, 0, 2, 2, 0, 2]
<Out>: Average Prediction 0.9666666666666668
How should we proceed?
We can reshape x_c and y_c for single datasets and verify if it is possible to predict the outcome of one feature vector x_train[43]:
x_cross_train,y_cross_train = np.reshape(x_c,((-1,x_c.shape[2]))),np.reshape(y_c,((-1)))
y_train[43]
<Out>: 1
x_test = x_train[43]
knn(x_test,x_train,y_train,number_of_neighbors=4)
<Out>: 1
Let’s also double the predictions for x_test =x_train[43] using the output from the last call of cross_validation():
batch_prob = []
for idx,batch in enumerate(x_c):
prob = knn(x_test,batch,y_c[idx],number_of_neighbors=4)
batch_prob.append(prob)
print(batch_prob,max(batch_prob ,key=batch_prob.count))
<Out>: [1, 1, 1, 1, 1] 1
How can we tune the hyper-parameter k?
Our last endeavor will be to tune the parameter k by considering multiple epochs, the k-fold cross-validation. We can generate random folds for each epoch and consider k in the interval [1,…,40].
number_of_folds = 5
epoch = 100
acc_k = []
acc_epoch = []
k_range = 40
for value in range(epoch):
for number_of_neighbors in range(k_range):
print("Epoch=",value,"k=",number_of_neighbors)
x_c,y_c,acc_list,ave_acc= cross_validation(x_train,y_train,number_of_folds,number_of_neighbors+1) #start from 0
acc_epoch.append(ave_acc)
<Out>: Epoch= 0 k= 0
<Out>: Prediction Fold: 0 [2, 0, 0, 0, 0, 2, 1, 0, 0, 2, 1, 0, 1, 1, 1, 2, 1, 0, 0, 2, 2, 1, 0, 0, 0, 1, 0, 2, 0, 1]
<Out>: Prediction Fold: 1 [0, 0, 1, 1, 0, 2, 0, 1, 0, 2, 1, 1, 1, 2, 0, 0, 0, 0, 2, 2, 2, 1, 0, 0, 2, 1, 1, 0, 1, 2]
<Out>: Prediction Fold: 2 [0, 0, 0, 2, 2, 0, 0, 1, 1, 1, 1, 1, 0, 2, 2, 0, 0, 0, 0, 2, 2, 0, 2, 2, 0, 1, 2, 0, 2, 2]
<Out>: Prediction Fold: 3 [2, 0, 2, 0, 0, 1, 0, 0, 0, 1, 0, 1, 2, 0, 1, 2, 0, 2, 1, 1, 0, 2, 2, 2, 0, 2, 1, 2, 2, 0]
<Out>: Prediction Fold: 4 [2, 1, 0, 0, 0, 2, 0, 0, 1, 2, 0, 2, 1, 0, 0, 1, 2, 0, 0, 0, 1, 1, 2, 0, 0, 1, 2, 2, 2, 1]
<Out>: Average Prediction 0.9800000000000001
..... until Epoch=100, k=40
How can we visualize the predictions?
As shown below, we can plot the results to visualize the accuracy vs. k (number of First Neighbors).
acc = np.asarray(acc_epoch)
acc=acc.reshape((-1,k_range))
print(acc)
array([[0.98 , 0.97333333, 0.97333333, ..., 0.94 , 0.94666667,
0.92666667],
[0.99333333, 0.96 , 0.98 , ..., 0.92666667, 0.94 ,
0.92666667],
[0.97333333, 0.98 , 0.97333333, ..., 0.93333333, 0.95333333,
0.9 ],
...,
[0.96 , 0.98 , 0.96666667, ..., 0.94666667, 0.94 ,
0.95333333],
[0.98666667, 0.98 , 0.97333333, ..., 0.92 , 0.89333333,
0.9 ],
[0.98 , 0.97333333, 0.96666667, ..., 0.94 , 0.92 ,
0.90666667]])
<In>: acc.shape # 5 epochs means 5 accuracies for each k, k=40
<Out>: (100, 40)
<In>: average = np.average(acc,axis=0)
<In>: np.std(acc,axis=0)
array([0.0110522 , 0.01171419, 0.01424921, 0.01294003, 0.01616842,
0.01398714, 0.01642911, 0.01565333, 0.01749463, 0.01541875,
0.01652527, 0.01483824, 0.01668785, 0.01601041, 0.01892335,
0.01634571, 0.01635658, 0.01689431, 0.01857118, 0.0149957 ,
0.01762561, 0.01872313, 0.01987472, 0.01785609, 0.01995317,
0.01648959, 0.02001988, 0.0215071 , 0.02129695, 0.01875444,
0.02148736, 0.01998889, 0.01848615, 0.01862794, 0.01952503,
0.02003819, 0.01947888, 0.0175741 , 0.02172106, 0.02007486])
# Generate a and aa for plotting
a=np.arange(40)
aa = [a]*epoch
aa.shape
<Out>: (100, 40)
# Import Matplotlib Library
import matplotlib.pyplot as plt
resol = 600 # Image resolution to save in dpi
fig = plt.figure(figsize=(9,7)) # definition of the size
plt1=plt.scatter(aa,acc,s=150,alpha=0.1,color='blue')
plt2=plt.scatter(a,average,s=150,alpha=.7,color='orange')
plt.plot(a,average,color='black')
plt.title('Average Prediction for 100 Epochs',fontsize=20) # Title and FontSize
plt.xlabel('k First Neighbors',size=20) # x-axis label and size
plt.ylabel('Accuracy',size=20) # y-axis label and size
plt.legend((plt1,plt2), #legend
('Distribution','Average'),
scatterpoints=1,
ncol=2,
fontsize=15,
loc='lower left')
plt.show(fig)
fig.savefig('prediction.png',dpi=resol)
This set of commands will generate the following figure:

Despite the stochasticity of our predictions, we have found a slight decrease in the average accuracy as the value of k increases. Future articles will discuss the bias-variance tradeoff, ROC curve, and the Confusion matrix for tuning the model’s parameters. In this session, you were able to run the k-fold cross-validation with the k-NN model to tune the hyper-parameter k. We have used an open-source dataset, the Iris Flower dataset, a high-dimensional dataset (multiple features) with three distinct classes. You have also built multiple functions using Python 3 and Numpy to perform predictions from scratch. The following article will dive into the Biopsy Data on Breast Cancer Patients. We will also explore the application add tuning procedures ROC curves and Confusion Matrix.
Resources:
[1] Fisher,R.A. “The use of multiple measurements in taxonomic problems” Annual Eugenics, 7, Part II, 179-188 (1936); also in “Contributions to Mathematical Statistics” (John Wiley, NY, 1950).
[Web Link]
[2] Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
[Web Link]
[3] Dasarathy, B.V. (1980) “Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognition in Partially Exposed Environments”. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-2, No. 1, 67-71.
[Web Link]
[4] Gates, G.W. (1972) “The Reduced Nearest Neighbor Rule”. IEEE Transactions on Information Theory, May 1972, 431-433.
[Web Link]
[5] See also: 1988 MLC Proceedings, 54-64.
[6] http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html;
[7] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer 2009
[8] L. Breiman, P. Spector Submodel selection and evaluation in regression: The X-random case, International Statistical Review 1992;
[9] R. Kohavi, A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection, Intl. Jnt. Conf. AI
[10] R. Bharat Rao, G. Fung, R. Rosales, On the Dangers of Cross-Validation. An Experimental Evaluation, SIAM 2008;
[11] G. James, D. Witten, T. Hastie, R Tibshirani, An Introduction to Statistical Learning, Springer 2013.
IRIS Flower Dataset: http://archive.ics.uci.edu/ml/datasets/Iris (Extensive list of articles)
Did you enjoy this article? Subscribe to our Newsletter.
You will receive notifications with fresh content, new courses and further activities.