There are various machine learning algorithms like KNN, Naive Bayes, Logistic Regression, Linear Regression, SVM, Ensemble models etc etc. But how do you measure the performance of these models ? Is there any one metric which will define how good a model is ? Well, in this blog I will discuss about the important KPIs (Key Performance Indicators) to measure the performance of a Machine Learning Model

  1. Accuracy
  2. Confusion matrix
  3. Precision, Recall & F1 score
  4. Receiver Operating Characteristic Curve (ROC)
  5. Log-loss
  6. R-Squared Coefficient
  7. Median absolute deviation (MAD)
  8. Error Distribution

Sample dataset: I will consider a sample test dataset to explain the metrices. The sample data contains 15 positive Y’s & 15 negative Y’s. It is a simple binary classification problem

P.S: We measure the performance of models based only on TEST data and NOT on train data/ cross-validation data.

Accuracy

Accuracy is the most easiest metric to understood & to implement as well. Accuracy is nothing but the ratio of correctly classified data points to the total number of data points.

Example: For the above sample dataset, if a model predicts 13 positive points & 14 negative points correctly, then

\[Accuracy = (13+14)/30\] \[Accuracy = 0.9\]

Cons

  1. Accuracy is a bad metric for a highly imbalanced dataset: Consider a highly imbalanced dataset like credit card fraud detection, where >99% of the data points are negative. Lets say we have very dumb model which will always predict negative, irrespective of given data points (i.e., X). And if we calculate the Accuracy for this dumb model, it would be more than 0.99 that means even a dumb model is showing an accuracy of 0.99 which is obviously wrong.

  2. Can not take probability scores: Accuracy don’t take probability scores into consideration. It will just consider the predicted Y values. For example

Lets say we have two Models M1 & M2.

ID Y M1_prob M2_prob M1_pred M2_pred
1 1 0.90 0.55 1 1
2 0 0.10 0.45 0 0
3 1 0.88 0.51 1 1
4 1 0.93 0.60 1 1
5 0 0.01 0.48 0 0

Where,

Y : real Y values
M1_prob: probability of Y = 1 for model M1
M2_prob: probability of Y = 1 for model M2
M1_pred: predicted Y values from model M1
M2_pred: predicted Y values from model M2

As we can see, both M1 & M2 models predicted the same Y’s & have same accuracy.But if you see probability scores of M1 & M2, the model M1 is performaning better than the model M2, since the probability scores are high for the M1 model, whereas for the model M2,the probability scores are almost half

Confusion matrix:

Confusion matrix is a matrix used to describe the performance of a classification model. It solves the problem the accuracy metric had with the imbalanced dataset

Lets consider a binary classification problem i.e., Y belongs to 0 or 1. 0 being negative label & 1 being positive label. Let Y be the actual values & Y^ be the predicted values

confusion_matrix

where,

TN (True Negative) : number of data points for which Y = 0 & Y^ = 0

FN (False Negative) : number of data points for which Y = 1 & Y^ = 0

FP (False Positive) : number of data points for which Y = 0 & Y^ = 1

TP (True Positive) : number of data points for which Y = 1 & Y^ = 1

Tip to remember: Here is a tip to remember which I learned from Applied AI course

confusion_matrix_tip

Example: Lets consider the case Y = 0, Y^ = 0. Here predicted value is 0, i.e, Negative (N) & the it is correct (T) (since the actual value is also Negative). So it becomes True Negative (TN)

If we have a multi class classification, then confusion matrix size would be CxC, where C is the number of class labels.

confusion_matrix_multi_class_classification

If the model you trained is very good, then all the principle diagonal values will be high & non-principle diagonal values will be low or zero.

Let P & N be the total number of Positive & Negative test data points respectively, Then

\[TPR (True Positive Rate) = TP/P\] \[TNR (True Negative Rate) = TN/N\] \[FPR (False Positive Rate) = FP/P\] \[FNR (False Negative Rate) = FN/N\]

In layman language,

TPR = Percentage of correctly classified positive points

TNR = Percentage of correctly classified negative points

FPR = Percentage of incorrectly classified positive points

FNR = Percentage of incorrectly classified negative points.

Example: Consider a highly imbalanced dataset like credit card fraud detection, where >99% of the data points are negative. Lets say we have very dumb model which will always predict negative, irrespective of given data points (i.e., X). And if we calculate the Confusion matrix for this dumb model, then TNR will be 100 & TPR will be zero but FNR will be high which indicats the model is very dumb

In Confusion matrix we have 4 numbers, how one will know which one is important? Well, it depends on the dataset.

Example: If the dataset is cancer detection/credit card fraud detection, then we want high TPR and very less or 0 FNR.

Cons

  1. Can not take probability scores: Confusion matrix don’t take probability scores into consideration. It will just consider the predicted Y values.

Precision, Recall, F1 score

Precision & Recall are used in information retrieval, pattern recognition & binary classification. Please refer Wiki for info. Precision & Recall considers only the positive class labels.

\[Precision = \frac{TP}{FP+TP}\] \[Recall = \frac{TP}{FN+TP}\]

In layman language,

Precision: Of all the predicted POSITIVE labels, how many of them are actually POSITIVE

Recall: Of all the actual POSITIVE labels, how many of them are correctly predicted as POSITIVE.

We always want Precision & Recall to be high. Precision & Recall values ranges from 0 to 1

F1 Score: It is a single measurement, which considers both Precision & Recall to compute the score. It is basically the harmonic mean of Precision & Recall. The value of F1 score ranges from 0 to 1.

\[F1 score = 2 *\frac{Precision*Recall}{Precision+Recall}\]

This metric is extensively used in Kaggle compitations. But Precision & Recall is more interpretable or I can say F1 score is difficult to interpret.

Modification of F1 Score For Multi label classification

In Multi-label Classification, multiple labels may be assigned to each instance and there is no constraint on how many of the classes the instance can be assigned to. Source: Wiki

There are two variations of F1 Scores

Micro F1 Score:

Calculate metrics globally by counting the total true positives, false negatives and false positives.

Lets say we have C labels, then Micro F1 score is defined as

\[F1_{micro} = 2 *\frac{Precision_{micro}*Recall_{micro}}{Precision_{micro}+Recall_{micro}}\]

where,

\[{Precision_{micro}}=\frac{\sum_{k∈C}TP_k}{\sum_{k∈C}(TP_k + FP_k)}\] \[{Recall_{micro}}=\frac{\sum_{k∈C}TP_k}{\sum_{k∈C}(TP_k + FN_k)}\]

This is a better metric when we have class imbalance since it does take into account about the label frequency

Macro f1 score:

It is the simple average of F1 scores of all labels. This does not take into account about the label frequency.

\[F1_{macro} = \frac{1}{|C|}\sum_{k∈C}F1Score_k\]

For info on F1 score clickhere

Receiver Operating Characteristic Curve (ROC)

ROC curve is a curve which is created by plotting TPR agains FPR. ROC Curve metric is applicable only for binary classification. Here is step by step procedure to draw ROC curve.

Lets assume, the model which we are using gives some kind of scores like probability scores (pridict_prob) in such a way that increase in score indicates higher probability of data point belonging to class 1. Lets assume Y^ denotes the predicted probability scores

1 = POSITIVE class

0 = NEGATIVE class

STEP 1: Sort Y^ in descending order.

X Y (actual) Y^
prob(Y==1)
\({x_1}\) 1 0.94
\({x_2}\) 1 0.87
\({x_3}\) 1 0.83
\({x_4}\) 0 0.80

STEP 2: Thresholding (\(\tau\)).

For each of the probability scores (lets say \({\tau_i}\)) in Y^ columns, if Y^ >= \({\tau_i}\), then predicted label would be positive(1).

X Y (actual) Y^ prob(Y==1) Y_predicted \(\tau\) = 0.94 Y_predicted \(\tau\) = 0.87 Y_predicted \(\tau\) = 0.83 Y_predicted \(\tau\) = 0.80
\({x_1}\) 1 0.94 1 1 1 1
\({x_2}\) 1 0.87 0 1 1 1
\({x_3}\) 1 0.83 0 0 1 1
\({x_4}\) 0 0.80 0 0 0 1

If we have N data points, then we will have N thresholds.

STEP 3: For each of the Y_predicted (\({\tau_i}\)), calculate TPR & FPR.

In our example

X TPR FPR
\(\tau\) = 0.94 0.33 0
\(\tau\) = 0.87 0.66 0
\(\tau\) = 0.83 1 0
\(\tau\) = 0.80 1 1

If we have N data points, then we will have N TPR,FPR pairs.

STEP 4: Plot TPR vs FPR by taking N TPR, FPR pair. Typically the Graph look like this

confusion_matrix

P.S: The example (TPR,FPR) pairs have not been plotted in the above graph

Random Model: Model which generates randomly 0 & 1’s

Good Model: Model which has high TPR & low FPR

Area Under ROC Curve (AUC): AUC is the area under the ROC curve. More the area under the curve, the more good is the model. A random model will have an AUC of 0.5

Cons

  1. Imbalanced dataset: For an imbalanced dataset, a dumb model might give high AUC, which may be misleading sometimes.

  2. Depends on the ORDERing of the scores & doesn’t depends the scores itself.

  3. Works only for binary classification problems

Converting a dumb model into a reasonable good model confusion_matrix

A dumb model is one which has AUC an < 0.5 In that case just swap class labels i.e., change 1 to 0 & 0 to 1 (Y_predicted = 1-Y_predicted) to make it a reasonably good model.

Log loss

It is an important metric & it uses probability scores to calculate the loss. This metric can be used for both binary classification & multi class classification. And it ranges from 0 to \(\infty\) (infinity). Log loss is a loss, so we always wants to minimize the losses. Lower the loss, the good is the model.

\[Log Loss = -\frac{1}{n}\sum_{i=1}^n \sum_{j=1}^c y_{ij}*log(p_i)\]

where,

log = natural log

n = number of datapoints

c = number of class labels

\(y_{ij}\) = 1 if data point \(x_{i}\) belongs to class j else 0

\(p_{i}\) = probability of \(x_i\) belonging to class j

Example: Lets consider a binary classification dataset

X Y (actual) Y^
prob(Y==1)
log loss log loss score
\({x_1}\) 1 0.94 -log(0.94) 0.0269
\({x_2}\) 0 0.20 -log(1-0.20) 0.0969
\({x_3}\) 1 0.70 -log(0.70) 0.1549
\({x_4}\) 1 0.49 -log(0.49) 0.3979

As we can see from the above table, the data point \({x_1}\) is having higher probability of belonging to class label 1 & so its log loss is less comparitively, where as for the data point \({x_4}\), the log loss is higher because its probability is almost half, i.e., the probability of \({x_4}\) belonging to 1 or 0 is almost same.

Cons

Difficult to interpret the log loss score.

R-Squared Coefficient

This metric is used to measure the performance of a regression problem (Y belongs to real values). Lets Y^ denotes the predicted values & Y denotes the actual values.

X Y (actual) Y^
\({x_1}\) 1.90 1.94
\({x_2}\) 1.34 1.20
\({x_3}\) 0.34 0.56

Sum of Squares (total): The simplest regression model one can build is the average model i.e, predicted value will be equal to average of all actual Y values.

\[{SS_{total}} = \sum_{i=1}^n (y- \overline{y})^2\]

where, \(\overline{y}\) is the average of all y values

Sum of Squares (residual): Residual means the error, which is the difference between the predicted value & real value. Lets say y^ is the predicted value which is predicted using regression algorithms like linear regression.

\[{SS_{res}} = \sum_{i=1}^n (y-y\hat{}) ^2\]

R Squared: It combines both \({SS_{total}}\) & \({SS_{res}}\)

\[R^2 = 1- \frac{SS_{res}}{SS_{total}}\]

Case 1: \({SS_{res}} = 0\)

This is the best case ever could happen, i.e., predicted values are equal to actual values, so zero errors, therefore \(R^2 = 1\) for the best case

Case 2: \({SS_{res}} < {SS_{total}}\)

This means our model is performing better than the average model. The values of \(R^2\) ranges from 0 to 1.

Case 3: \({SS_{res}} = {SS_{total}}\)

This means our model is performing as good as the average model. So \(R^2 = 0\)

Case 4: \({SS_{res}} > {SS_{total}}\)

This means our model is performing worst than the average model. So \(R^2 < 0\)

Cons

\(R^2\) is not very robust to outliers

Median absolute deviation (MAD)

It overcomes the problem we had with \(R^2\). Median is robust to outliers

Median: It is the central tendancy or the midpoint of observed values

e: It is the difference between the actual value & the predicted value i.e,

\[e = y - y^\]

MAD: It is the median of the absolute difference between the error (e) and the median of errors for all data points

\[MAD = median(|e - {median_e}|)\]

where,

\({median_e}\) is the median of the errors

Error Distribution

By using PDFs & CDFs we can see the distribution of errors & so we can measure the performance of the models.

error distribution

Observations:

PDF: As we can see, most of the errors are having smaller values where as few errors are having larger values

CDF: 95% of the errors are having the values less than 0.1 & only 5% of the errors are having values > 0.1.

Compare two models using CDFs of errors

compare CDFs

Model M1: 95% of the errors are having less than 0.1 value

Model M2: 90% of the errors are having less than 0.1 value

Since we always want errors to be zero or close to zero, Model M1 is better than Model M2, because more number of errors are close to zero in Model M1 than in Model M2

__

Thanks for reading. Please contact me regarding any queries or suggestions.