Quick Overview: Cross-Validation, Regression, Classification, Confusion Matrix and ROC.

Last Update: 05 Aug 2022

Welcome to the k-NN algorithm from Scratch: Biopsy Data on Brest Cancer Patients.

These sequential articles will guide you towards understanding how the k-Nearest Neighbours (k-NN) algorithm can be developed from scratch using Python. This session will explore tuning procedures such as the k-fold cross-validation, application of the k-NN for regression, classification, interpretation of the Confusion Matrix and ROC curve. We will also use the sklearn package for regression. 

Please consider reading my previous posts entitled K-NN algorithm from Scratch: Part IPart II, and Part III.

What is our goal?

The goal is to explore the Biopsy Data on Breast Cancer Patients and use the k-NN algorithm to “predict” if some of the patients have developed a “malign” or “benign” breast cancer. We will also perform tuning procedures and analyze how to enhance the prediction accuracy of these recommendations.

What is the Biopsy Data on Breast Cancer Patients?

Dr. William H. Wolberg obtained the breast cancer database at the University of Wisconsin Hospitals. It provides biopsies of breast tumors from 699 patients until July 1992. This dataset includes nine features (attributes) scored on a scale of 1 to 10, where the outcome can be malignant or benignant breast cancer. Therefore, we have data from 699 patients (rows) and 11 columns. You can download the dataset below:

Biopsy Data on Breast Cancer Patients (Github)

This dataset contains the following columns:

ID: sample code number (not unique).

V1: clump thickness.

V2: uniformity of cell size.

V3: uniformity of cell shape.

V4: marginal adhesion.

V5: single epithelial cell size.

V6: bare nuclei (16 values are missing).

V7: bland chromatin.

V8: normal nucleoli.

V9: mitoses.

class: “benign” or “malignant”.

Our first step is to split our dataset into training dataset with the variables x_train, y_train ( Binary choice 0 or 1), and a validation dataset x_test and y_test. However, our dataset will be loaded with pandas and normalized with the function normalize_data(), as follows:

# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import csv
np.random.seed(2) # Seed to randomly generate the cross-validation matrices

folder = 'dataset-biopsy/'
filename = 'biopsy.csv'
def load_biopsy():
    
    # Import the Biopsy Dataset
    biopsy = pd.read_csv(folder+filename, na_values='?', 
                         dtype={'ID': str}).dropna().reset_index()
    
    # Split in training and test data
    trainI = np.random.choice(biopsy.shape[0], size=300, replace=False)    
    trainIndex=biopsy.index.isin(trainI)    
    train=biopsy.iloc[trainIndex] # This is the Training Dataset
    test=biopsy.iloc[~trainIndex] # Test Dataset
    
    # Extract relevant data features
    X_train = train[['V1','V2','V3','V4','V5','V6','V7','V8','V9']].values
    X_test = test[['V1','V2','V3','V4','V5','V6','V7','V8','V9']].values    
    Y_train=(train['class']=='malignant').astype(int).values.reshape((-1,1))
    Y_test=(test['class']=='malignant').astype(int).values.reshape((-1,1))
    
    return X_train, Y_train, X_test, Y_test

# This function is responsible for finding the Maximum and Minimum values for each 
# column and normalize vectors for the whole dataset.

def normalize_data(data_input):
    min_max_vector = []
    
    [min_max_vector.append([np.min(data_input[:][:,i]),np.max(data_input[:][:,i])]) 
     for i in range(data_input.shape[1])]
    
    min_max_vector = np.asarray(min_max_vector)
    
    # Normalization of the data_input by Broadcasting operations over data_input[:]
    normalized_dataset = (data_input[:]-min_max_vector[:][:,0])/(min_max_vector[:][:,1]-min_max_vector[:][:,0])
        
    return normalized_dataset

Let us check the function load_biopsy(): Line 17 generates 300 random indexes from a biopsy with the shape (683, 13) without considering the class column. This line np.random.choice(biopsy.shape[0], size=300, replace=False) is responsible to generate the indexes. Line 18 (biopsy.index.isin(trainI)) creates a boolean vector where True has the indexes of trainI. Lines 19 and 20 splits the dataset biopsy intro training and testing datasets, where the training has 300, and the testing has 383. 

It is essential to highlight that lines 25 and 26 transform the classes into binary vectors where malignant = 1 and benign = 0.

For the training dataset, we will have:

print((train['class']=='malignant').astype(int).values)
array([0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1])

Now we can load and normalize both datasets x_train and x_test and change the size of x_train from 300 by considering only 20% of the whole dataset. It would be interesting to play with the parameters and try to tune your model given a small training dataset.

# Load function and extract the training and testing dataset.
x_train,y_train,x_test,y_test,biopsy = load_biopsy()
x_train,x_test = normalize_data(x_train),normalize_data(x_test)
x_train[:3],x_train.shape,x_test.shape

(array([[0.44444444, 0.33333333, 0.33333333, 0.44444444, 0.66666667,
         1.        , 0.22222222, 0.11111111, 0.        ],
        [0.22222222, 0.        , 0.        , 0.        , 0.11111111,
         0.11111111, 0.22222222, 0.        , 0.        ],
        [0.55555556, 0.77777778, 0.77777778, 0.        , 0.22222222,
         0.33333333, 0.22222222, 0.66666667, 0.        ]]), (300, 9), (383, 9))

The next step is to write the functions developed in Part III. For 50 folds and k=5 (first neighbors), the accuracy for the cross-validation is ~63%.

# k-fold Cross validation
def k_fold_cross_validation(x_train,y_train,number_of_folds):
    start = 0
    size = len(x_train)
    size_of_fold = int(len(x_train)/number_of_folds)
    index_matrix = np.random.randint(start,size,size=(number_of_folds,size_of_fold))
    x_split = x_train[index_matrix]
    y_split = y_train[index_matrix]
    return x_split,y_split
  
# Compute Accuracy 
def find_accuracy(y_pred,y_test):
    acc = np.mean(y_pred == y_test)
    return acc

# Euclidean metrics
def euclidean_distance(x_row,y_row):
    output_distance = np.sqrt(np.sum((x_row-y_row)**2)) # Element-wise sum and square root
    return output_distance
  
# Find the First Neighbors
def find_neighbors(data_test,data_train,y_train,number_of_neighbours):
    new_distance_list = np.array([])
    distance_list = [] # List to append euclidian distances along with the row variables
    for idx,value in enumerate(data_train):
        distance = euclidean_distance(data_test,value) # Euclidian distances or #distance_list.append((data_test,row,distance))
        distance_list.append((value,distance,y_train[idx])) # Append the distances and respective rows to a list
    
    distance_list = np.array(distance_list)
    new_distance_list = distance_list[distance_list[:,1].argsort(kind='mergesort')] # Sort rows based on distance
    
    # Select the first neighbours
    x_first_neighbours = new_distance_list[:number_of_neighbours,:new_distance_list.shape[1]-2] # Select only the first neighbours 
    y_first_neighbours = new_distance_list[:number_of_neighbours,new_distance_list.shape[1]-1]

    
    return x_first_neighbours,y_first_neighbours
  
# Predict() runs find_neighbors() and output the most common label
def predict(data_test, data_train,y_train,number_of_neighbors):
    x_neighbors,y_neighbors = find_neighbors(data_test,data_train,y_train,number_of_neighbors)
    y_neighbors = list(y_neighbors)
    output = max(y_neighbors,key=y_neighbors.count)
    return output
  
# knn() is responsible to make the predictions
def knn(x_test,x_train,y_train,number_of_neighbors):
    x_test_prediction = []
    if(len(x_test)==9):
        out_predict = predict(x_test,x_train,y_train,number_of_neighbors)
        return(out_predict)
    else:
        for idx,value in enumerate(x_test):
            out_predict = predict(value,x_train,y_train,number_of_neighbors)
            x_test_prediction.append(out_predict)
        return x_test_prediction
     
# Evaluate an algorithm using a cross validation split
def cross_validation(x_train,y_train, number_of_folds, number_of_neighbors):
    accuracy = []
    average_acc = np.array([])
    x_cross,y_cross = k_fold_cross_validation(x_train,y_train,number_of_folds)
    inv_matr_identity = np.where(np.identity(number_of_folds,dtype=int)==1, False, True)
    
    for idx in range(number_of_folds):
        x_fold_train = x_cross[inv_matr_identity[idx]]
        y_fold_train = y_cross[inv_matr_identity[idx]]
        x_t = np.reshape(x_fold_train,(-1,x_cross.shape[2]))
        y_t = np.reshape(y_fold_train,((-1)))
        forecast = knn(x_cross[idx],x_t,y_t,number_of_neighbors)
        print('Prediction Fold:',idx,forecast)
        acc = find_accuracy(forecast,y_cross[idx])
        accuracy.append(acc)
    average_acc = np.average(np.asarray(accuracy))
    print('Average Prediction',average_acc)
    return x_cross,y_cross,accuracy,average_acc
number_of_folds = 50
number_of_neighbors = 5
x_c,y_c,acc_list,ave_acc= 
      cross_validation(x_train,y_train,number_of_folds,number_of_neighbors)

<Out>: ... Prediction Fold: 49 [1, 1, 0, 1, 0, 1]
Average Prediction 0.633333333333333

The low accuracy obtained by using the cross-validation represents the idea that small batches of random samples taken from x_train are not enough to predict the outcomes of the validation dataset using our model. However, we have found a ~97% accuracy rate using k=5 when testing the model on x_test using the whole dataset x_train. It is crucial to notice that the model could be overfitting if the testing accuracy shows higher values than predictions performed on your training dataset. Let us try the following:

y_pred = knn(x_test,x_train,y_train,number_of_neighbors=5)
batch_prob = []
      for idx,batch in enumerate(x_c):
          prob = knn(x_test,batch,y_c[idx],number_of_neighbors)
          batch_prob.append(prob)

max_batch = max(batch_prob ,key=batch_prob.count)
find_accuracy(max_batch,y_test)
<Out>: 0.6501305483028721

accuracy = find_accuracy(y_test,y_pred)
print(accuracy)
<Out>: 0.9686684073107049

Now, we will run the cross-validation five times (epochs) using 50 folds by varying k from 1 to 40:

number_of_folds = 50
epoch = 5
acc_k = []
acc_epoch = []
k_range = 40

for value in range(epoch):
    for number_of_neighbors in range(k_range):
        print("Epoch=",value,"k=",number_of_neighbors)
        x_c,y_c,acc_list,ave_acc= cross_validation(x_train,y_train,number_of_folds,number_of_neighbors+1) #start from 0
        acc_epoch.append(ave_acc)
        
def acc_reshape(acc_epoch,k_range):
   acc = np.asarray(acc_epoch)
   acc=acc.reshape((-1,k_range))
   average = np.average(acc,axis=0)
   std = np.std(acc,axis=0)
   a=np.arange(k_range)
   aa = [a]*epoch
   aa=np.array(aa)
   return acc,aa,a,average,std
  
acc,aa,a,average,std = acc_reshape(acc_epoch,k_range)
  
# Therefore, we can plot the scattering with 3 colors:
resol = 600
fig = plt.figure(figsize=(9,7))
plt1=plt.scatter(aa,acc,s=150,alpha=0.4,color='blue')
plt2=plt.scatter(a,average,s=150,alpha=.7,color='orange')
plt.plot(a,average,color='black')
plt.title('Breast Cancer Biopsy - Prediction 5 Epochs',fontsize=20)
plt.xlabel('k First Neighbors',size=20)
plt.ylabel('Accuracy',size=20)
plt.legend((plt1,plt2),
           ('Distribution','Average'),
           scatterpoints=1,
           ncol=2,
           fontsize=15,
           loc='best')
plt.show(fig)
fig.savefig('prediction_20_biopsy.png',dpi=resol)

We can finally plot the distribution after reshaping the outputs from the loops:

Thus, the average does not vary extensively for each k (yellow ball), indicating stability. The results can be more robust if we increase the number of epochs, expanding the set of random samples per loop.

How can we use the Confusion Matrix and ROC curves?

The confusion matrix is a 2×2 table where the elements correspond to the following error rates:

  • True Positive (TP): Model correctly predicts benign

  • (Prediction = Benign and Real = Benign)

  • False Positive (FP): Model incorrectly predicts benign instead of malignant

  • (Prediction = Benign and Real = Negative)

  • True Negative (TN): Model correctly predicts malignant

  • (Prediction = Malignant and Real = Malignant)

  • False Negative (FN): Model incorrectly predicts malignant instead of benign

  • (Prediction = Malignant and Real = Benign)

True Positive Rate = TPR = \frac{TP}{P} , P = Total of Positives

False Positive Rate = FPR = \frac{FP}{N}, N = Total of Negatives

The ROC curve is simply a plot of FPR (x-axis) x TPR (y-axis). The ROC curves can be helpful for the visualization and comparison of the classifier performance. The idea is to choose a k where the area below the ROC curve is optimized. The ROC (Receiver Operating Characteristics) represents a probability curve, and the AUC (Area Under The Curve) measures the separability. How well does your model distinguish between classes 0 and 1? When we maximize the AUC, we better predict if the patients have the disease and no disease. The best option is to find AUC closer to 1. If AUC is 0.5, we cannot correctly distinguish if the patients have the disease or not.

You can also read more information about ROC curves in the following link:

https://people.inf.elte.hu/kiss/11dwhdm/roc.pdf

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html

We have also built the function confusion_table() which is used to compute TP, FP, TN, and FN:

# Brute Force Binary Confusion Table ;)
def confusion_table(y_pred,y_test):
    
    TP = FN = FP = TN = 0
    for idx in range(y_test.shape[0]):
        if(y_pred[idx]==1 and y_test[idx]==1):
            TN +=1
        elif(y_pred[idx]==0 and y_test[idx]==1):
            FP +=1
        elif(y_pred[idx]==1 and y_test[idx]==0):
            FN +=1
        elif(y_pred[idx]==0 and y_test[idx]==0):
            TP +=1
    confusion = np.array([[TP,FN],[FP,TN]])
    return confusion
confusion_report = confusion_table(y_pred,y_test)
print(confusion_report)
<Out>: [[243   6]
        [  6 128]]

We can see that out of 130 patients affected by malignant breast cancer, 126 were correctly diagnosed! TPR = 126/130 = 0,97

See ISL p. 148–149 for definitions of TPR and other classification diagnostics.

You can also generate a report by using the object sklearn.metrics from the sklearn package:

#Input:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

confusion_matrix_result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix Result:")
print(confusion_matrix_result)
classification_report_result = classification_report(y_test, y_pred)
print("Classification Report:",)
print (classification_report_result)
print("Accuracy:",accuracy)

#Output:

Confusion Matrix Result:
[[243   6]
 [  6 128]]
Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.98      0.98       249
           1       0.96      0.96      0.96       134

    accuracy                           0.97       383
   macro avg       0.97      0.97      0.97       383
weighted avg       0.97      0.97      0.97       383

Accuracy: 0.9686684073107049

How can we predict the misclassification for different k’s?

We define misclassification as the average of distinct predictions when compared to the real output vector y_test:

np.mean(knn_forecast != y_test)

Then, we will also compute both accuracy and misclassification for the first k=100, as follows:

maxiter = 100
misclass=[]
accuracy_knn = []
for i in range(maxiter):
    knn_forecast = knn(x_test,x_train,y_train,i+1)
    misclass.append(np.mean(knn_forecast != y_test))
    accuracy_knn.append(find_accuracy(y_test,knn_forecast))
    
K = np.linspace(1,maxiter,maxiter)
resol=600
fig = plt.figure(figsize=(15,12))
plt.title('Biopsy Breast Cancer - k=1,..,200')
plt.subplot(2, 2,1)
plt.title('Misclassification x k', fontsize=20)
plt.scatter(K,misclass,color = 'r',alpha=.4,s=50,label = 'Data')
plt.plot(K,misclass,color = 'r',label = 'Data')
plt.xlabel('number of neighbors k',fontsize=15)
plt.ylabel('Misclassification',fontsize=15)
plt.xlim(0,maxiter+1)
plt.subplot(2, 2,2)
plt.title('Accuracy x k', fontsize=20)
plt.scatter(K,accuracy_knn,color = 'b',alpha=.4,s=50,label = 'Data')
plt.plot(K,accuracy_knn,color = 'b',label = 'Data')
plt.xlabel('number of neighbors k',fontsize=15)
plt.ylabel('Accuracy',fontsize=15)
plt.xlim(0,maxiter+1)
plt.show()
plt.tight_layout()
fig.savefig('prediction_misclassification_biopsy.png',dpi=resol)
How would the same system perform if you chose to split the x_train using only 20% of the whole dataset? I will leave that question for you to experiment with.

Now we proceed to the k-NN regressor …

So far, we have explored the classification by considering the most common label as the output of the k first neighbors. However, we can also solve a regression problem on the k first-neighbors. For this part, we will use the sklearn package to estimate the mean accuracy for x_test and x_train and the ROC curve.

The following code will estimate the accuracy of the k-NN regression by varying from k=1 up to k=100, as we save the predicted accuracy for the training and testing datasets at each iteration:

# Import required packages
from sklearn.neighbors import KNeighborsClassifier,KNeighborsRegressor
from sklearn.metrics import mean_squared_error 
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

# Calculate the error
def error_(a,b):
    error = np.sqrt(np.average((a-b)**2))
    return error
  
rmse_val = [] #to store rmse values for different k
rmse_val_x_train = []
max_iter=100
for k_value in range(max_iter):
    model = KNeighborsRegressor(n_neighbors = k_value+1)
    model.fit(x_train, y_train)  
    pred_test=model.predict(x_test) 
    pred_train = model.predict(x_train)
    rmse = error_(y_test,pred_test)
    rmse_x_train = error_(y_train,pred_train)#calculate rmse
    rmse_val.append(rmse)
    rmse_val_x_train.append(rmse_x_train) #store rmse values
    print('rmse: k= ' , k_value , 'Prediction:', rmse, 'Training:', rmse_x_train)
    
fig = plt.figure(figsize=(8,5))
plt.plot(np.arange(len(rmse_val)),rmse_val,'orange')
plt.plot(np.arange(len(rmse_val_x_train)),rmse_val_x_train,'blue')
plt.scatter(np.arange(len(rmse_val)),rmse_val,color='orange',alpha=.3,s=150)
plt.scatter(np.arange(len(rmse_val_x_train)),rmse_val_x_train,color='blue',alpha=.3,s=150)
plt.xlabel('number of neighbors k',fontsize=15)
plt.ylabel('RMSE',fontsize=15)
plt.title('RMSE x k (Regression Model)',fontsize=15)
plt.legend(('x_test','x_train'),
           scatterpoints=1,
           ncol=2,
           fontsize=15,
           loc='lower right')
plt.xlim(0,100)
fig.savefig('regression_model_accuracy.png',dpi=resol)
#Our model:
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=100, p=2,
                    weights='uniform')

Additionally, we can plot accuracy using the validation and testing dataset, as shown in the following figure:

This figure shows that the Root Mean Square Error (RMSE) is optimal (minimum) for k=5.

Scikit-learn Documentation:

https://scikit-learn.org/stable/modules/classes.html

But tell me something about the ROC curve …

We will use the skilearn classifier to create the model and predict the probabilities for the x_test. After this procedure, we can plot the ROC curve:

knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(x_train,np.reshape(y_train,(-1)))
scores = knn.predict_proba(x_test)
fpr, tpr, thresholds = roc_curve(y_test,scores[:, 1])
roc_auc = auc(fpr, tpr)

fig = plt.figure(figsize=(8,5))
plt.ylabel('True Positive Rate (TPR)',fontsize=15)
plt.xlabel('False Positive Rate (FPR)',fontsize=15)
plt.title('ROC Curve using k-NN for k=5',fontsize=20)
plt.plot(fpr, tpr, 'orange', label = 'AUC = %0.2f' % roc_auc)
plt.scatter(fpr, tpr, color='orange',alpha=.5,s=150)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')

plt.show()
fig.savefig('regression_ROC_curve_knn.png',dpi=resol)

Our ROC curve is a plot of TPR vs. FPR. We have found an Area Under Curve (AUC) = 0.98, which is the classifier’s performance taking all possible thresholds into account and considering k=5 first neighbors. We highlighted the AUC=0.5 in red for the interval [0, 1].

Well, It’s almost over. Let’s recapitulate how far you have gone! 

You have worked with built-in functions to classify and predict the breast cancer state of random patients x_test as malignant or benign. We have considered the k-fold cross-validation procedure, k-NN classification and regression, Confusion Matrix, and tuning procedures, such as verifying misclassification for different k and the ROC curve values. We have found that k=5 is an outstanding parameter to perform predictions in this dataset.

I hope you have enjoyed the k-NN series. These brief articles are free samples for the upcoming Machine Learning and Deep Learning Courses. Thank you, and please do hesitate to provide feedback!

Did you enjoy this article? Subscribe to our Newsletter.

You will receive notifications with fresh content, new courses and further activities.


Show CommentsClose Comments

Leave a comment