Last Update: 02 Jan 2022
This article will explore applications of statistical learning techniques using the Biopsy Data on Breast Cancer Patients dataset. We will dive into linear methods, such as Logistic Regression (LR), Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA). We will also use the sklearn library package to “predict” if some of the patients have developed a “malign” or “benign” breast cancer. The main idea is to demonstrate the potentials of simple statistical learning approaches in real-world datasets.
We have also explored the same dataset in the sequential series of articles: k-NN algorithm from Scratch: Part I, Part II, Part III, and Part IV.
What is the Biopsy Data on Breast Cancer Patients?
Dr. William H. Wolberg obtained the breast cancer database at the University of Wisconsin Hospitals. It provides biopsies of breast tumors from 699 patients until July 1992. This dataset includes nine features (attributes) scored on a scale of 1 to 10, where the outcome can be malignant or benignant breast cancer. Therefore, we have data from 699 patients (rows) and 11 columns. You can download the dataset below:
Biopsy Data on Breast Cancer Patients (Github)
This dataset contains the following columns:
ID: sample code number (not unique). V1: clump thickness. V2: uniformity of cell size. V3: uniformity of cell shape. V4: marginal adhesion. V5: single epithelial cell size. V6: bare nuclei (16 values are missing). V7: bland chromatin. V8: normal nucleoli. V9: mitoses. class: “benign” or “malignant”.
Our first step is to run the libraries that will be necessary to start our experiment.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import *
%matplotlib inline
np.random.seed(2)
biopsy = pd.read_csv('Data/biopsy.csv',na_values='?',dtype={'ID': str}).dropna().reset_index()
This is a categorical dataset, meaning that we must adapt the target vectors from qualitative labels to binary where malignant=0 and benign=1 could represent one’s health state.
Examples of categorical problems can be found on Medical diagnosis, where one can predict the future condition of a patient given previous symptoms and health state; Banking, where it is possible to forecast fraud using location, historical transactions, frequency of categorical transactions, etc.; Advertisement: using the clients’ history, site, and web search tags, we can predict the likelihood of increased engagement and clicks to boost advertisement and revenue; Entertainment industry, platforms such as Spotify and Netflix make extract and process engineered features from songs and movies, respectively, taking into account the customers’ wishlist to provide better recommendations.
First, we will discuss the Linear and Multi-linear Regression models, which are straightforward methods and very useful in some study cases. Often, two quantities x and y, both directly measured, are related by a theoretical expression y_i:
y_i = \beta_0+\beta_1x_i + \epsilon_i
such that \epsilon_i \approx N(0,\sigma^2) is the noise. The latter involves parameters that can be evaluated from the observed dataset. The most common and simplest case is when the linear equation can fit the quantities, where \beta_1 is the slope and \beta_0 represents the intersection with the Y-axis. The simplest problem is to adjust the best set of parameters to the experimental/ground truth by considering a linear equation. If the pairs (x_i, y_i) are considered “true” values, we would be able to fit the line perfectly. Nevertheless, since x_i and y_i are subject to errors, each point’s position is not precisely determined. So, instead of the ideal point, there is an ellipse with axes s_x and s_y (standard deviations), whose centers are not expected to lie on a straight line but distributed on each side. The least-squares method suggests that the best line minimizes the sum of the squares of the distances from the ellipses’ centers to the line. This distance is measured in some appropriate direction, which depends on the relative deviations of x and y‘s, and whether these quantities have the same physical dimension. The simplest case relies on supposing that one of the quantities, say, the variable x_i, is measured precisely, while all errors are focused on the y_i. This situation can be represented graphically by vertical lines centered on the (x_i, y_i). The relationship defines the deviation of y_i as y_i-\overline{y}.
Anyway, our primary goal is to estimate the parameters of the model \beta_0 and \beta_1 that minimizes the residual sum of squares using the training dataset:
E =\sum_{i=1}^{n} (y_i – \overline{y})^2 =\sum_{i=1}^{n} (y_i – \beta_0 -\beta_1x_i)^2 = \sum_{i=1}^{n} y_i^2 +n\beta_0^2 + \beta_1 \sum_{i=1}^{n}x_i^2 – 2\beta_0\sum_{i=1}^{n}(y_i – \beta_1 x_i)- 2\beta_1\sum_{i=1}^{n}(y_ix_i)
The error represents the differences between the model’s prediction and the ground truth values obtained from an experiment or observation, and we need to choose the parameters that minimizes E. To find the parameters of the model that minimizes the sum of the errors, we can use the partial derivates \frac{\partial E}{\partial \beta_0} and \frac{\partial E}{\partial \beta_1} (\frac{\partial E}{\partial \beta_i}=0):
\frac{\partial E}{\partial\beta_1} = 2\beta_1 \sum_{i=1}^{n} x_i^2 + 2\beta_0 \sum_{i=1}^{n} x_i – 2\sum_{i=1}^{n}(x_i y_i) = 0 ~(1)\newline \frac{\partial E}{\partial\beta_0} = 2n\beta_0 – \sum_{i=1}^{n} y_i + 2\beta_1 \sum_{i=1}^{n} x_i = 0 \\ \rightarrow \beta_0 = \sum_{i=1}^{n}y_i -\beta_1 (n^{-1} \sum_{i=1}^{n}x_i) = \overline{y}-\beta_1 \overline{x}~(2)
where~\overline{x}=n^{-1}\sum_{i=1}^{n} x_i~,~\overline{y}=n^{-1}\sum_{i=1}^{n} y_i \newline
and applying equation (2) into (1), we have:
\beta_1 (\sum_{i=1}^{n} x_i^2 – \overline{x}\sum_{i=1}^{n} x_i) = \sum_{i=1}^{n} x_i y_i – \overline{y}\sum_{i=1}^{n}(x_i)
\beta_1 = \frac{n\sum_{i=1}^{n}x_i y_i – \sum_{i=1}^{n}y_i \sum_{i=1}^{n}x_i}{n\sum_{i=1}^{n} x_i^2 – (\sum_{i=1}^{n} x_i)^2 } = \frac{\sum_{i=1}^{n}(x_i -\overline{x})(y_i – \overline{y})}{\sum_{i=1}^{n}(x_i – \overline{x})^2}
and if we substitute \beta_1, \overline{x} and \overline{y} into (2), we obtain:
\beta_0=\frac{\sum_{i=1}^{n}x_i^2 \sum_{i=1}^{n}y_i – \sum_{i=1}^{n}x_i \sum_{i=1}^{n}x_i y_i}{n\sum_{i=1}^{n} x_i^2 – (\sum_{i=1}^{n} x_i)^2 } = \frac{1}{n}(\sum_{i=1}^{n}y_i – \beta_1\sum_{i=1}^{n}x_i)
We can also estimate the intervals with a probability of 95% to reproduce the measurements by using the computed parameters, where \beta_0[95\%] = [\beta_0 \pm 2*\sigma_{\beta_0} ] and \beta_1[95\%] = [\beta_1 \pm 2*\sigma_{\beta_1} ]:
\sigma_{\beta_0}^2 = \left[ \sigma^2 n^{-1} + \frac{\overline{x}}{\sum_{i=1}^{n}(x_i-\overline{x})^2}\right] \\ \sigma_{\beta_1}^2 = \left[ \frac{\sigma^2}{\sum_{i=1}^{n}(x_i-\overline{x})^2} \right]
where \sigma^2 is the variance.
If you want to go deeper, please try to follow the references [1] and [2]. In our course, we will also discuss the hypothesis test and interpretation of linear models. Since we are moving towards the logistic regression, it worth mentioning that if we have more than one feature, which is the case of our dataset, it would be more reasonable to work with multi-linear regression. Therefore, this new model implies:
y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2}, … + \beta_p x_{i,p} + \epsilon_{i}
where p is the number of features and i represents the number of data for training. The latter can also be represented through a matrix formulation, where we can have a feature vector {X_i}=[x_{i,1},x_{i,2},…,x_{i,p}] for each datapoint.
Furthermore, we can express the vector Y as:
Y = \beta_0 + \beta_1 X_{1} + \beta_2 X_{2}, … + \beta_p X_{p} + \epsilon_{i}
The next step is to use the least-square minimum approach, where:
E(\beta) = \sum_{i=1}^{m}(y_i – \overline{y}(\beta))^2 = \sum_{i=1}^{m}(y_i -\beta_0 – \beta_1 x_{i,1}-\beta_2 x_{i,2}, … – \beta_p x_{i,p})^2 \\ = ||Y-X\beta||^2_2, ~where~ \beta = (\beta_0, \beta_1,…,\beta_p)
However, we do not know if the variables X_i are useful for predicting the outcome of Y, and actually, how many variables would be reasonable to use? Furthermore, how accurate are these predictions? These nuances will be discussed in our machine learning course.
Let us consider the matrix formulation for our multi-linear regression model X\beta=Y, therefore, we can rewrite the latter as (X^TX)\beta = X^TY. By using matrix operations:
(X^TX)^{-1}(X^TX)\beta = (X^TX)^{-1}X^TY \rightarrow (X^TX)^{-1}(X^TX) = \mathbb{1} \\ \beta = (X^TX)^{-1}X^TY
We can find the parameters of the model that minimizes the sum of the square of the errors. However, each patient has a medical statement (Benign or Malignant breast cancer), and by taking the Bayesian approach, we can assign a probability of having a benign or malignant condition for each patient given a set of features (medical conditions). We know P(Y|X) from the training dataset. But our goal is to predict if a patient has cancer, which means that, given an input feature vector for patient i X_i=[x_{i,1},x_{i,2},…,x_{i,p}], we need to predict the response Y_i:
\hat {y} = argmax_{y} ~\hat {P_i}(Y=y | X=X_i)
and since Y is a binary variable (benign or malignant), the probabilities can be expressed as:
\hat {P_i}(Y=0 | X=X_i) = \beta_0 + \beta_1 X_{i,1} + + \beta_2 X_{i,2} + … + + \beta_1 X_{i,p}
The main problem relies on the fact that this model allows the probability to be lower than 0 and higher than 1, and therefore \hat {P}(Y=0 | X=X_i) + \hat {P}(Y=1 | X=X_i) \neq 1 . Furthermore, it would be extremely hard to extend this model to multi-classes beyond binary. Instead, we can model the joint probability as:
\hat {P_i}(Y=1 | X=X_i) = \frac{e^{\beta_0 + \beta_1 X_{i,1}+\beta_2 X_{i,2}+..+\beta_p X_{i,p}}}{1+e^{\beta_0 + \beta_1 X_{i,1}+\beta_2 X_{i,2}+..+\beta_p X_{i,p}}} = \frac{e^{\sum_{j=1}^p \beta_j X_{i,j}}}{1+e^{\sum_{j=1}^p \beta_j X_{i,j}}}
and since P(Y=1|X) + P(Y=0|X) = 1, therefore:
\hat {P_i}(Y=1 | X=X_i) = 1-\frac{e^{\beta_0 + \beta_1 X_{i,1}+\beta_2 X_{i,2}+..+\beta_p X_{i,p}}}{1+e^{\beta_0 + \beta_1 X_{i,1}+\beta_2 X_{i,2}+..+\beta_p X_{i,p}}} =\frac{1}{1+e^{\beta_0 + \beta_1 X_{i,1}+\beta_2 X_{i,2}+..+\beta_p X_{i,p}}}= \frac{1}{1+e^{\sum_{j=1}^p \beta_j X_{i,j}}}
which is the same as if we use the logaritmic function:
\log \left[\frac{P_i(Y=1|X)}{P_i(Y=1|X)} \right] = \beta_0 + \beta_1 X_{i,1}+\beta_2 X_{i,2}+..+\beta_p X_{i,p}
and a list of pairs (x_1,y_1),(x_2,y_2),…,(x_n,y_n) as the training dataset. The probability associated with the whole dataset is represented by the likelihood, and our goal is to find the set of parameters \beta_0,beta_1,…,\beta_p that maximize the likelihood. The likelihood L(\beta) can be expressed as:
L = \prod_{i=1}^{n} P(Y=y_i|X=X_i) \\ L=\prod_{i=1}^{n} \left[ \frac{e^{\beta_0 + \beta_1 X_{i,1}+\beta_2 X_{i,2}+..+\beta_p X_{i,p}}}{1+e^{\beta_0 + \beta_1 X_{i,1}+\beta_2 X_{i,2}+..+\beta_p X_{i,p}}} \right]^{y_i} \\* \left[ \frac{1}{1+e^{\beta_0 + \beta_1 X_{i,1}+\beta_2 X_{i,2}+..+\beta_p X_{i,p}}} \right]^{1-y_i}
where the probability is represented by \prod_{i=1}^{n} P(Y=1|X=X_i)^1 P(Y=0|X=X_i)^{(1-1)}=\prod_{i=1}^{n} P(Y=1|X=X_i) for y_i = 1, and filtered for the second term inside the product when y_i=0, therefore \prod_{i=1}^{n} P(Y=1|X=X_i)^0 P(Y=0|X=X_i)^{(1-0)}=\prod_{i=1}^{n} P(Y=0|X=X_i). In our machine learning course, you will learn how to solve this problem from scratch using Newton’s algorithm. But for now, you will use the sklearn library to find the parameters of the model that best fits our training data. Furthermore, we will use these parameters to predict the outcome classes of a testing data set of patients. Then, we will compare our results with the ground truth to estimate the accuracy of our model. Let’s start!
We have the index, ID, 9 features from briopsy[:,2:11] and the classification as benign or malign.
print(biopsy.columns.values)
['index' 'ID' 'V1' 'V2' 'V3' 'V4' 'V5' 'V6' 'V7' 'V8' 'V9' 'class']
print(biopsy) # This print the whole dataset
index ID V1 V2 V3 V4 V5 V6 V7 V8 V9 class
0 0 1000025 5 1 1 1 2 1.0 3 1 1 benign
1 1 1002945 5 4 4 5 7 10.0 3 2 1 benign
2 2 1015425 3 1 1 1 2 2.0 3 1 1 benign
3 3 1016277 6 8 8 1 3 4.0 3 7 1 benign
4 4 1017023 4 1 1 3 2 1.0 3 1 1 benign
.. ... ... .. .. .. .. .. ... .. .. .. ...
678 694 776715 3 1 1 1 3 2.0 1 1 1 benign
679 695 841769 2 1 1 1 2 1.0 1 1 1 benign
680 696 888820 5 10 10 3 7 3.0 8 10 2 malignant
681 697 897471 4 8 6 4 3 4.0 10 6 1 malignant
682 698 897471 4 8 8 5 4 5.0 10 4 1 malignant
[683 rows x 12 columns]
biopsy.head() #check the first 5 patients
index ID V1 V2 V3 V4 V5 V6 V7 V8 V9 class
0 0 1000025 5 1 1 1 2 1.0 3 1 1 benign
1 1 1002945 5 4 4 5 7 10.0 3 2 1 benign
2 2 1015425 3 1 1 1 2 2.0 3 1 1 benign
3 3 1016277 6 8 8 1 3 4.0 3 7 1 benign
4 4 1017023 4 1 1 3 2 1.0 3 1 1 benign
We can also double check the data types for our features using the method <object>.info(), and store features from column 2 to 10 in the biopsy_features array by using the <object>.iloc[:,2:11]:
biopsy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 683 entries, 0 to 682
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 683 non-null int64
1 ID 683 non-null object
2 V1 683 non-null int64
3 V2 683 non-null int64
4 V3 683 non-null int64
5 V4 683 non-null int64
6 V5 683 non-null int64
7 V6 683 non-null float64
8 V7 683 non-null int64
9 V8 683 non-null int64
10 V9 683 non-null int64
11 class 683 non-null object
dtypes: float64(1), int64(9), object(2)
memory usage: 64.2+ KB
biopsy_features = biopsy.iloc[:, 2:11]
We can also visualize the scattering of the features. The plot bellow shows the correlation between the variables, and an histogram in the diagonal.
resol=300
pd.plotting.scatter_matrix(biopsy_features,figsize=(15,12),s=20, marker = 'D',color="b",alpha=.3)
plt.savefig('scatter_matrix.png',dpi=resol)
plt.show()

In the next step, we can select features randomly to split our data into training and testing datasets. This procedure can be accomplished by selecting features of any size, as specified by np.random.choice([index_column], size=<number_of_datasets>,replace=<False>). In our case, we can generate an array with 200 random indexes from the biopsy data set.
randomIndex = np.random.choice(biopsy.shape[0],size=200,replace=False)
print(randomIndex)
[107 204 631 554 297 128 77 614 602 194 22 655 1 236 342 373 441 15
656 108 439 30 451 199 642 89 203 84 141 632 595 345 335 67 37 673
589 402 548 18 547 398 276 383 272 681 321 621 372 375 459 400 246 426
448 484 409 116 653 467 68 512 445 268 594 172 142 344 55 559 338 648
334 80 192 318 379 432 232 244 504 531 275 606 227 309 109 411 212 235
492 330 438 377 393 280 129 343 540 583 171 32 265 437 354 553 499 619
651 416 16 646 464 479 10 157 3 389 62 423 542 13 536 161 609 363
40 180 65 72 248 205 113 314 447 519 143 514 333 308 130 120 480 206
158 182 508 12 311 267 419 183 101 123 213 152 397 489 661 292 294 231
176 209 590 134 239 582 139 198 486 60 502 312 165 629 221 517 597 225
251 331 131 58 369 347 465 160 647 495 258 146 53 136 395 197 381 524
365 663]
Now we can use <object>.index.isin(<array_with_new_index>) to generate a boolean array, by comparing the elements of randomIndex with those elements of the biopsy dataset. These randomly filtered indexes will be True in this new array, and it will compose our new training dataset. This is a very interesting way to split your dataset.
trainIndex = biopsy.index.isin(randomIndex)
print(trainIndex[:10]) # Print the first 10 elements of the array
[False True False True False False False False False False]
biopsy.iloc[0:4] # This print the first 4 elements of biopsy.
index ID V1 V2 V3 V4 V5 V6 V7 V8 V9 class
0 0 1000025 5 1 1 1 2 1.0 3 1 1 benign
1 1 1002945 5 4 4 5 7 10.0 3 2 1 benign
2 2 1015425 3 1 1 1 2 2.0 3 1 1 benign
3 3 1016277 6 8 8 1 3 4.0 3 7 1 benign
# Now we can generate a new table by filtering only the indexes for the training dataset.
train = biopsy.iloc[trainIndex]
train.index # show the index
Int64Index([ 1, 3, 10, 12, 13, 15, 16, 18, 22, 30,
...
647, 648, 651, 653, 655, 656, 661, 663, 673, 681],
dtype='int64', length=200)
train[:10] # shows the first ten elements of the training dataset
index ID V1 V2 V3 V4 V5 V6 V7 V8 V9 class
1 1 1002945 5 4 4 5 7 10.0 3 2 1 benign
3 3 1016277 6 8 8 1 3 4.0 3 7 1 benign
10 10 1035283 1 1 1 1 1 1.0 3 1 1 benign
12 12 1041801 5 3 3 3 2 3.0 4 4 1 malignant
13 13 1043999 1 1 1 1 2 3.0 3 1 1 benign
15 15 1047630 7 4 6 4 6 1.0 4 3 1 malignant
16 16 1048672 4 1 1 1 2 1.0 2 1 1 benign
18 18 1050670 10 7 7 6 4 10.0 4 1 2 malignant
22 22 1056784 3 1 1 1 2 1.0 2 1 1 benign
30 31 1071760 2 1 1 1 2 1.0 3 1 1 benign
~trainIndex will generate the testing dataset, which is represented by ~trainIndex indexes. ~is a not operator, and it modifies the indexes from True to False, and from False to True, which means that our new dataset will be composed of those indexes that are not present in the new training data set.
test = biopsy.iloc[~trainIndex]
test.index
Int64Index([ 0, 2, 4, 5, 6, 7, 8, 9, 11, 14,
...
671, 672, 674, 675, 676, 677, 678, 679, 680, 682],
dtype='int64', length=483)
Now we can work on the model using the sklearn package. We will use the function linear_model.LogisticRegression to estimate the parameters of our model. If you want to know more, please check the sklearn documentation in the link below: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
We will also use linear_model.LogisticRegression, which is a function from sklearn library, and fit the model:
logistic_model = linear_model.LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
# Define the first three features to train the model
X_train = train[['V1','V2','V3']]
Y_train = train[['class']]
X_test = test[['V1', 'V2', 'V3']]
Y_test = test['class']
Y_train.iloc[0:10]
class
1 benign
3 benign
10 benign
12 malignant
13 benign
15 malignant
16 benign
18 malignant
22 benign
30 benign
logistic_model.fit(X_train,Y_train.values.ravel())
LogisticRegression(multi_class='ovr', n_jobs=1, solver='liblinear')
The next step is to compute the probability matrix, which has 2 columns, where first column represents P(Y=Benign),a nd the second column is P(Y=Malignant). We will also create an empty vector and assign P(Y)>0.5 to benign, and P(Y)<0.5 to malignant. Furthermore, we will compute the Confusion Matrix and accuracy of our model:
predict_prob = logistic_model.predict_proba(X_test)
predict_prob.shape # shape of predicted probability
(483, 2)
print('Predicted probabilities for the classes:')
predict_prob[0:10]
Predicted probabilities for the classes:
array([[0.87363213, 0.12636787],
[0.92862772, 0.07137228],
[0.90461864, 0.09538136],
[0.00366983, 0.99633017],
[0.96076402, 0.03923598],
[0.9268389 , 0.0731611 ],
[0.94694774, 0.05305226],
[0.86534144, 0.13465856],
[0.94694774, 0.05305226],
[0.06169136, 0.93830864]])
prediction = np.empty(len(X_test))
prediction = np.where(predict_prob[:, 0]>=0.5, 'benign', 'malignant')
print(prediction[0:5])
array(['benign', 'benign', 'benign', 'malignant', 'benign'], dtype='<U9')
print(pd.crosstab(prediction,Y_test)) #Confusion Matrix
class benign malignant
row_0
benign 317 14
malignant 4 148
acc_lr=np.mean(prediction == Y_test) #Compute Accuracy
print(acc_lr)
0.9627329192546584
The Logistic Regression model shows 96% of accuracy for the predictions in this specific dataset!
What happens if you decide to run an experiment such as a loop to calculate the accuracy by using not only the features V1, V2, and V3, but combining {V1, V2},{V1, V2, V3}, {V1, V2, V3, V4} until {V1, V2,.., V9}?
# First create a dictionary
features = {"V1":0,"V2":1,"V3":2,"V4":3,"V5":4,"V6":5,"V7":6,"V8":7,"V9":8}
X_list = [] # [V1], [V1,V2], ... , [V1,...,V9]
new_Y_train = Y_train.values.ravel()
new_Y_test = Y_test.values.ravel()
accuracy_prediction = []
for i in features:
X_list.append(i)
new_X_train = train[X_list]
new_X_test = test[X_list]
logistic_model.fit(new_X_train,new_Y_train)
predict_prob = logistic_model.predict_proba(new_X_test)
new_prediction = np.empty(len(new_X_test), dtype=object)
new_prediction = np.where(predict_prob[:, 0]>=0.5, 'benign', 'malignant')
accuracy_prediction.append(np.mean(new_prediction == new_Y_test)) #Accuracy
We can also create a label vector to help plotting the accuracy of each prediction by considering more features at each step of the loop.
The following code will help us visualize the Accuracy vs. Features, where the latter is labeled in the plot:
labels = []
for i in range(len(X_list)):
labels.append("[V1-V"+str(i+1)+"]")
print(labels)
['[V1-V1]',
'[V1-V2]',
'[V1-V3]',
'[V1-V4]',
'[V1-V5]',
'[V1-V6]',
'[V1-V7]',
'[V1-V8]',
'[V1-V9]']
print(accuracy_prediction)
[0.8716356107660456,
0.9544513457556936,
0.9627329192546584,
0.9648033126293996,
0.9627329192546584,
0.9751552795031055,
0.9751552795031055,
0.9772256728778468,
0.979296066252588]
resol=600
lower_limit = -.5
upper_limit = +.5
x_axis = range(len(accuracy_prediction))
fig=plt.figure(figsize=(9,7))
plt.plot(x_axis,accuracy_prediction,color='yellow',linewidth=2)
plt.scatter(x_axis,accuracy_prediction,color='blue',marker="D",s=200,alpha=.5)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.xlabel('Features',fontsize=20)
plt.ylabel('Accuracy',fontsize=20)
plt.title('Accuracy x Features', fontsize=25)
plt.xlim(lower_limit,len(labels)+upper_limit)
for i,txt in enumerate(labels):
plt.annotate(txt,(x_axis[i],accuracy_prediction[i]), fontsize=15)
plt.tight_layout()
fig.savefig('accuracy_lr_model.png',dpi=resol)
plt.show()

We see that it is possible to have an accuracy of approximately 96% using fewer features, and higher accuracy of ~98% with nine features (all the features). However, many other aspects such as the bias-variance trade-off, overfitting, and increasing time complexity for larger datasets and features could decrease the efficiency of our predictions in production. The trade-off between bias and variance will be thoroughly discussed in our course, along with many different tuning procedures that can help secure robust forecasting and predictions. I have decided to split this Article in two and include the Linear Discriminant Analysis and Quadratic Discriminant Analysis in Part II. I hope you enjoyed this article.
References:
[1] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer 2009
[2] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer 2016
Did you enjoy this article? Subscribe to our Newsletter.
You will receive notifications with fresh content, new courses and further activities.