 Last Update: 15 Jan 2022

#### Welcome to Part II of Predictions using Logistic Regression, LDA, and QDA with sklearn: Breast Cancer Dataset, Part II

We will explore the Quadratic Discriminant Analysis (QDA) in Part II using the sklearn library to “predict” if some of the patients have developed a “malign” or “benign” breast cancer

Dr. William H. Wolberg obtained the breast cancer database at the University of Wisconsin Hospitals. It provides biopsies of breast tumors from 699 patients until July 1992. This dataset includes nine features (attributes) scored on a scale of 1 to 10, where the outcome can be malignant or benignant breast cancer. Therefore, we have data from 699 patients (rows) and 11 columns. You can download the dataset below:

This dataset contains the following columns:

ID: sample code number (not unique). V1: clump thickness. V2: uniformity of cell size. V3: uniformity of cell shape. V4: marginal adhesion. V5: single epithelial cell size. V6: bare nuclei (16 values are missing). V7: bland chromatin. V8: normal nucleoli. V9: mitoses. class: “benign” or “malignant”.

#### Our first step is to import the libraries to run our experiment.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import *
%matplotlib inline

np.random.seed(2)
dtype={'ID': str}).dropna().reset_index()

### How to use Linear and Quadratic Discriminant Analysis to make predictions? Give me a picture of the Math behind the sklearn library!

It can be a bit complex, but try to follow these steps. In the Logistic Regression, we have determined P(Y|X). In this case, we can directly estimate the initial distribution P(X|Y), given the response Y=1 or Y=0, as well as estimate the likelihood of a category P(Y). If we use the Bayes rule, we are able to estimate P(Y = y | X = x_i):

P(Y = y | X = x_i) = \frac{P(X=x_i|Y=y)P(Y=y)}{P(X=x_i)}

If we consider the joint probability:

P(Y=y_j,X=x_i)=P(X=x_i|Y=y_j)P(Y=y_j) \newline \\ P(X=x_i) = \sum_{j=1}^{n} P(X=x_i,Y=y_j)=\sum_{j=1}^{n} P(Y=y_j,X=x_i) = \\ \sum_{j=1}^{n}P(X=x_i,Y=y_j)P(Y=y_j)

We are able to obtain a final expression:

P(Y = y | X = x_i) = \frac{P(X=x_i|Y=y_j)P(Y=y_j)}{P(X=x_i)} \\ = \frac{P(X=x_i|Y=y_j)P(Y=y_j)} {\sum_{j=1}^{n}P(X=x_i,Y=y_j)P(Y=y_j)} ~(3)

I fully recommend you to check reference , section 1.2 Probability Theory by Bishop.

The LDA method represents a subclass of the Bayes rule, where P(X=x_i|Y=y_j) = N(\mu_j,\sum), and \sum represents the covariance matrix from a multivariate normal distribution. What do we know? The probability of a patient being classified with malignant or benignant cancer using a fraction of the training dataset. This can be formulated as P(Y=y_j)=\pi_j. We need to define P(X=x_i|Y=y_j) as a multivariate normal distribution with density:

f_j(x_i) = \frac{1}{{2\pi}^{\frac{p}{2}}\vert\sum\vert^{\frac{1}{2}}}e^{-\frac{1}{2} (x_i-\mu_j)^T {\sum}^{-1}(x_i-\mu_j)}

where \mu_j represents the mean of the input values for a specific category Y=y_j. In equation (3) we can substitute the probability density f_j{x_i} and P(Y=y_j)=\pi_j for the category y_j given the input x_i:

P(Y = y_j | X = x_i) = \frac{f_j(x_i) \pi_j} {\sum_{j=1}^{n}P(X=x_i,Y=y_j)P(Y=y_j)} \\ = \frac{f_j(x_i) \pi_j} {P(X=x_i)} = C~f_j(x_i) \pi_j

Since the denominator has no dependency with the response y_j, we can substitute P(X=x_i) by a constant C. Now we can expand the previous equation and assign a new constant C’ embedding all the parameters that are independent of y_j, then:

P(Y = y_j | X = x_i) = C~f_j(x_i) \pi_j = \frac{C\pi_j}{{2\pi}^{\frac{p}{2}}\vert\sum\vert^{\frac{1}{2}}}e^{-\frac{1}{2} (x_i-\mu_j)^T \sum^{-1} (x_i-\mu_j)} = C’ \pi_j e^{-\frac{1}{2} (x_i-\mu_j)^T {\sum}^{-1} (x_i-\mu_j)}

Take a deep breath! If we apply the \log on both sides, we can rewrite the previous equation as:

\log{P(Y = y_j | X = x_i)} = \log \left[{ C’ \pi_j e^{-\frac{1}{2} (x_i-\mu_j)^T \sum^{-1} (x_i-\mu_j)}}\right] = \log{C’} + \log{\pi_j} + [-\frac{1}{2} (x_i-\mu_j)^T {\sum}^{-1} (x_i-\mu_j)] \\ = \log{C’} – \frac{1}{2}x_i^T {\sum}^{-1}x_i + \log{\pi_j} – \frac{1}{2}\mu_j^T {\sum}^{-1}\mu_j + x_i^T {\sum}^{-1}\mu_j

where \log{C’} is independent of the class y_j.  We also need to maximize the terms that are dependent of y_j. This can be solved by assigning a constant C” to the first two terms C”=\log{C’} – \frac{1}{2}x_i^T {\sum}^{-1}x_i , which allows us to rewrite the previous equation as:

\log{P(Y = y_j | X = x_i)} = C” + \log{\pi_j} – \frac{1}{2}\mu_j^T {\sum}^{-1}\mu_j + x_i^T {\sum}^{-1}\mu_j = C” +\delta_j(x_i)

where \delta_j(x_i) is also known as the discriminant term that needs to be maximized. Therefore, our goal is to predict the highest value for the discriminant term \delta_j(x_i) (response), given the conditions expressed by a feature vector that represent the patient x_i.

More on Multivariate Gaussians and Covariance Matrices on references [1-3].

#### How to compute \mu_j ?

\mu_j is the fraction of training data (K) that belongs to the class j, which can be benign or malignant. We can estimate \mu_j through the following expression:

\pi_j = \frac{K}{n}

and the parameters of f_j(x_i) can be estimated by considering that \mu_j represents the center of each class.:

\mu_j = \frac{1}{K}\sum_{i;y_i=y_j}^{K} x_i

The sklearn library has internal procedures to estimate the covariance matrix \sum. We will use the sklearn library to assess the accuracy of our predictions. We will start with nine features for our training and testing dataset, and create a model with sklearn.discriminant_analysis.LinearDiscriminantAnalysis(). The training dataset is represented by X_train with its respective classes assigned to Y_train.

X_test = test[['V1', 'V2', 'V3','V4','V5','V6','V7','V8','V9']]
X_train = train[['V1', 'V2', 'V3','V4','V5','V6','V7','V8','V9']]

lda_model = discriminant_analysis.LinearDiscriminantAnalysis(solver='svd', store_covariance=False, tol=0.0001)

print(X_train)

V1  V2  V3  V4  V5    V6  V7  V8  V9
1     5   4   4   5   7  10.0   3   2   1
3     6   8   8   1   3   4.0   3   7   1
10    1   1   1   1   1   1.0   3   1   1
12    5   3   3   3   2   3.0   4   4   1
13    1   1   1   1   2   3.0   3   1   1
..   ..  ..  ..  ..  ..   ...  ..  ..  ..
656   2   1   1   1   2   1.0   3   1   1
661   5   1   1   1   2   1.0   1   1   1
663   2   1   1   1   2   1.0   1   1   1
673   1   1   1   1   2   1.0   1   1   8
681   4   8   6   4   3   4.0  10   6   1

[200 rows x 9 columns]


The next step is to fit the model, predict the classes using the testing dataset X_test, and compute the accuracy of our model:

lda_model.fit(X_train,Y_train.values.ravel())
lda_prediction = lda_model.predict_proba(X_test)

# Print the first 10 predicted probabilities:
with np.printoptions(precision=3, suppress=True):
print(lda_prediction[:10])

[[1.    0.   ]
[1.    0.   ]
[1.    0.   ]
[0.    1.   ]
[0.991 0.009]
[1.    0.   ]
[1.    0.   ]
[1.    0.   ]
[1.    0.   ]
[0.    1.   ]]

# Check the classes
# Checking if the classes are appropriate.
lda_model.classes_
array(['benign', 'malignant'], dtype='<U9')

# Generate the predicted classes
lda_prob = np.empty(len(X_test), dtype=object)
lda_prob = np.where(lda_prediction[:, 0]>=0.5, 'benign', 'malignant')

#Confusion Matrix
print(pd.crosstab(lda_prob,Y_test))

class      benign  malignant
row_0
benign        318         10
malignant       3        152

# Print the accuracy
print('Accuracy of the predictions using the LDA')
np.mean(lda_prob == Y_test) #Accuracy

Accuracy of the predictions using the LDA

0.9730848861283644

The LDA method has shown accuracy of ~97%.

The main difference between the QDA and LDA relies on the fact that QDA has a multivariate normal distribution with different covariances for each y_j. In the quadratic discriminant analysis we have a separate \mu_j and covariance matrix \sum for each class j. Therefore, we can have \sum_j instead of . only \sum. If we apply the Bayes rule, we are able to rewrite the discriminant. We will start from the later expression:

P(Y = y_j | X = x_i) = C~f_k(x_i) \pi_j = \frac{C\pi_j}{{2\pi}^{\frac{p}{2}}\vert {\sum}_j \vert^{\frac{1}{2}}}e^{-\frac{1}{2} (x_i-\mu_k)^T \sum^{-1} (x_i-\mu_k)} \\= C’ \vert{\sum}_j\vert^{-\frac{1}{2}} \pi_j e^{-\frac{1}{2} (x_i-\mu_j)^T {\sum}^{-1} (x_i-\mu_j)}

and again, by applying the \log on both sides we have:

\log{P(Y = y_j | X = x_i)} = \log \left[{ C’ \vert{\sum}_j\vert^{-\frac{1}{2}} \pi_j e^{-\frac{1}{2} (x_i-\mu_j)^T \sum^{-1} (x_i-\mu_j)}}\right] = \log{C’} -{\frac{1}{2}} \log{\vert{\sum}_j\vert} + \log{\pi_j} + [-\frac{1}{2} (x_i-\mu_j)^T {\sum}_j^{-1} (x_i-\mu_j)] \\ = \log{C’}-{\frac{1}{2}} \log{\vert{\sum}_j\vert} – \frac{1}{2}x_i^T {\sum}_j^{-1}x_i + \log{\pi_j} – \frac{1}{2}\mu_j^T {\sum}_j^{-1}\mu_j + x_i^T {\sum}_j^{-1}\mu_j

where \log{C’} is independent of the class y_j. In this case, we need to maximize only the terms that are dependent on y_j. Therefore, the term \frac{1}{2}x_i^T {\sum}_j^{-1}x_i remains intact:

\log{P(Y = y_j | X = x_i)} = C”+ \log{\pi_j} -{\frac{1}{2}} \log{\vert{\sum}_j\vert} – \frac{1}{2}x_i^T {\sum}_j^{-1}x_i – \frac{1}{2}\mu_j^T {\sum}_j^{-1}\mu_j + x_i^T {\sum}_j^{-1}\mu_j= C” +\delta^{QDA}_j(x_i)

Finally, \delta_j(x_i) is the discriminant term that needs to be maximized. We need to predict the highest value for the discriminant term \delta^{QDA}_j(x_i) given the conditions assigned by the feature vector from patient x_i, where:

\delta^{QDA}_j(x_i)= \log{\pi_j} -{\frac{1}{2}} \log{\vert{\sum}_j\vert} – \frac{1}{2}x_i^T {\sum}_j^{-1}x_i – \frac{1}{2}\mu_j^T {\sum}_j^{-1}\mu_j + x_i^T {\sum}_j^{-1}\mu_j

We will use sklearn with the training and testing dataset to predict the accuracy of our model using the function sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis().

qda_model = discriminant_analysis.QuadraticDiscriminantAnalysis()
qda_model.fit(X_train,Y_train.values.ravel())

qda_prediction = qda_model.predict_proba(X_test)

# Print the first 10 predicted probabilities:
with np.printoptions(precision=3, suppress=True):
print(qda_prediction[:10])

[[1. 0.]
[1. 0.]
[1. 0.]
[0. 1.]
[0. 1.]
[1. 0.]
[1. 0.]
[1. 0.]
[1. 0.]
[0. 1.]]

qda_prob = np.empty(len(X_test), dtype=object)
qda_prob = np.where(qda_prediction[:, 0]>=0.5, 'benign', 'malignant')

print(pd.crosstab(qda_prob,Y_test)) #Confusion Matrix

class      benign  malignant
row_0
benign        307          3
malignant      14        159

print('Accuracy of the predictions using the LDA')
np.mean(qda_prob == Y_test) #Accuracy

Accuracy of the predictions using the LDA

0.9648033126293996

QDA also shows very high accuracy of 96%.

The assumptions from the LDA:

• LDA usually contains data distributed normally and a class-mean vector.
• LDA assumes a standard covariance matrix, therefore common to all classes in a data set.

The LDA can be approximated through the Bayes classifier, and \delta_j^{LDA} assumes a linear decision boundary. However, the LDA generally achieves good performances, even without these assumptions being respected. LDA is more flexible than QDA because we need to estimate fewer parameters since we do not have a covariance matrix for each class. Furthermore, small datasets are enough to decrease the variance for the LDA. However, the bias-variance trade-off needs to be considered in the case of QDA, since for multiple K classes with different covariance matrices, LDA suffers from higher bias, and QDA should be a better option. How do you decide which method is more appropriate? This is a question for a new Aicavity Connect Insights.