How you choose metrics determines how machine learning algorithms are measured and compared. This also brings to fore the importance of different characteristics in the results and your ultimate choice of which algorithm to choose. Evaluation is the first and the most important step in choosing any machine learning algorithm. One model may pass muster using a particular metric say, accuracy score but fare poorly when evaluated against other metrics such as logarithmic loss or any other such metric.
Though the classification metric happens to be the most popular to measure accuracy, it may not be enough to truly judge our model. In this post, we will cover the different types of evaluation metrics available.
Being the most common type of machine learning problem, classification accuracy is the most common evaluation metric for classification problems, it is also the most misused. Classification accuracy is the number of correct predictions made as a ratio of all predictions made.
However, it is really only suitable when there is an equal number of observations in each class.
For example, consider that there are 98% samples of class A and 2% samples of class B in our training set. Then our model can easily get 98% training accuracy by simply predicting every training sample belonging to class A.
When the same model is tested on a test set with 60% samples of class A and 40% samples of class B, then the test accuracy would drop down to 60%. Classification Accuracy is great but gives us a false sense of achieving high accuracy.
Area Under Curve
Area Under Curve(AUC) is one of the most widely used metrics for evaluation. It is used for binary classification problems. AUC of a classifier is equal to the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example. Before defining AUC, let us understand two basic terms :
- True Positive Rate (Sensitivity): True Positive Rate is defined as TP/ (FN+TP). True Positive Rate corresponds to the proportion of positive data points that are correctly considered as positive, with respect to all positive data points.
- False Positive Rate (Specificity): False Positive Rate is defined as FP / (FP+TN). False Positive Rate corresponds to the proportion of negative data points that are mistakenly considered as positive, with respect to all negative data points.
False Positive Rate and True Positive Rate both have values in the range [0, 1]. FPR and TPR bot hare computed at threshold values such as (0.00, 0.02, 0.04, …., 1.00) and a graph is drawn. AUC is the area under the curve of plot False Positive Rate vs True Positive Rate at different points in [0, 1].
Logarithmic Loss or Log Loss works by penalizing false classifications. It is a performance metric where the classifier must assign a probability to each class for all the samples. It works well for multi-class classification.
The scalar probability between 0 and 1 can be seen as a measure of confidence for a prediction by an algorithm.
y_ij, indicates whether sample i belongs to class j or not
p_ij, indicates the probability of sample i belonging to class j
Log Loss has no upper bound and it exists on the range [0, ∞). Log Loss nearer to 0 indicates higher accuracy, whereas if the Log Loss is away from 0 then it indicates lower accuracy.
The confusion matrix as the name suggests gives us a matrix as output and describes the complete performance of the model.
Let’s assume we have a binary classification problem. We have some samples belonging to two classes: YES or NO. Also, we have our own classifier which predicts a class for a given input sample. On testing our model on 165 samples, we get the following result.
There are 4 important terms :
True Positives: The cases in which we predicted YES and the actual output was also YES.
True Negatives: The cases in which we predicted NO and the actual output was NO.
False Positives: The cases in which we predicted YES and the actual output was NO.
False Negatives: The cases in which we predicted NO and the actual output was YES.
Accuracy for the matrix can be calculated by taking an average of the values lying across the “main diagonal” i.e
Confusion Matrix forms the basis for the other types of metrics.
F1 Score is the Harmonic Mean between precision and recall. The range for F1 Score is [0, 1]. It tells you how precise your classifier is (how many instances it classifies correctly), as well as how robust it is (it does not miss a significant number of instances).
High precision but lower recall, gives you an extremely accurate, but it then misses a large number of instances that are difficult to classify. The greater the F1 Score, the better is the performance of our model. Mathematically, it can be expressed as :
F1 Score tries to find the balance between precision and recall.
- Precision: It is the number of correct positive results divided by the number of positive results predicted by the classifier.
- Recall: It is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive).
Cognitive View uses Machine Learning models to analyze unstructured customer communications data to gain comprehensive insights into your firm’s obligations and real-time view of compliance adherence. Based on the ML model performance evaluation best practices, mechanisms have been built into the platform to measure the performance and transparency of all the AI models.
You can see a case study on choosing a medicine at a low cost, which applies all the metrics discussed in the article.