Evaluation Basics¶

The following topics are meant to be an quick introduction to basic metrics we use in the ylab to evaluate the performance of various models that classify labeled data.

Testing and Training¶

It is important to evaluate models on data which are not used in the training process, referred to as test data. Careful evaluation is crucial in computational biology (as well as ML in general!) to avoid models that overfit to patterns present in training data. This is an easy pitfall, and we must always be on the lookout for models that seem too good to be true. Having a completely independent validation set is best whenever possible. For example, when training a classifier on TCGA data to predict whether a sample is a pancreatic tumor or is healthy tissue, having another independent cancer dataset from a different group would be important to have as a testing set in addition to cross-validation. While it is not always feasible to have completely independent data for testing, we should strive to have as much independent data as possible to ensure model generality.

Cross-Validation¶

Cross-validation (or CV) should be done for any given model as a quick way to check performance. CV refers to the practice of partioning some of the training set for evaluation. The amount that is withheld for a testing 'fold' can vary, but often, data is either partitioned into k equal amounts, or all but one sample is excluded for testing in leave-one-out-CV (LOOCV), both of which are described in more detail below. Popular choice of libraries with cross-validation include scikit-learn and in R, the caret package.

k-fold Cross-Validation¶

K-fold cross-validation is a technique in which the data are divided into 'folds' to test the prediction model's performance on held-out datasets. Folds are most typically created by randomized shuffling and can be crafted to maintain the balance of class labels (e.g., keeping the distribution of tissues the same, etc). Of the k folds, one fold is hidden from the prediction algorithm to be used as a test set, and the rest of the folds are fed into the algorithm as the training set. This process is repeated until every fold has had a chance to be the held-out test set. The k results are then summarized with relevance curves (precision and recall, ROC/AUC, etc) or reported with summary statistics (median, MCC, etc).

Leave-One-Out-Cross-Validation¶

In leave-one-out-cross validation models are trained on all of the data except for one sample. Thus, for a dataset with N=10 samples, 10 models would be trained and tested. In LOOCV, you will get lower bias (difference in the training sample) and higher variance (since predictions are made on just one sample) for the test metrics. Practically, LOOCV is only better than k-fold CV when the total number of samples is small.

Precision and Recall¶

For binary classification, precision and recall refer to summary metrics describing the model's confusion matrix (aka error matrix). The confusion matrix includes the classifier's performance using four numbers: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). At each threshold, the precision and recall are calculated using the following:

precision = TP/(TP+FP)
recall = TP/(TP+FN)

Because both precision and recall are based on number of TP (always a positive integer or 0), they lie between 0 and 1.

A metric that combines both precision and recall into one score is the F-measure. The most commonly used F-measure is the F1-score, which is the harmonic mean of precision and recall.

F1_score = precision x recall / precision + recall

F1-scores also fall between 0 and 1, with values of 1 indicating perfect precision and recall.

A precision-recall curve can be generated by plotting precision at different levels of recall (Examples). Below are some examples of baseline, good, and perfect precision and recall curves.

example image

Image source

The baseline curve refers to the model classifying every input as positive, resulting in a constant value at all thresholds.

ROC/AUC¶

Receiver Operating Characteristics (ROC) curve and Area Under the Curve (AUC) also use the classifier's confusion matrix to generate curves. The ROC curve plots the true positive rate (TPR, sensitivity) and the false positive rate (FPR, 1-specificity) of the classifier at various threshold, calculated using the following:

TPR = TP/(TP+FN)
FPR = FP/(FP+TN)

The AUC is calculated using the ROC by taking the area under the ROC curve. The value ranges from 0 to 1 and indicates the diagnostic ability of the model. Since the AUC encompasses all thresholds, the AUC is a numeric that is scale-invariant and threshold-invariant (Examples). Popular libraries with ROC include pROC and scikit-learn. Below are some ROC curves with perfect, moderate, and random-chance AUC.

example image

Image source

Since the ROC curve and AUC underscore the diagnostic ability of the model, classifiers are expected to have better AUC than random-chance AUC in practice. With classifiers aiming for clinical implementation, generally a near-perfect AUC, greater than 0.90, is considered as competitive.

Matthew's Correlation Coefficient¶

Matthew's Correlation Coefficient (MCC), also known as phi coefficient, is another summary metric for the confusion matrix. Unlike other metrics, MCC utilizes all four numerics of the matrix (TP, TN, FP, and FN) and can be calculated by the following:

MCC= (TNTP - FPTN) / √((TN+FN)(FP+TP)(TN+FP)(FN+TP))

Due to their heavy reliance on TP, the previous metrics can be biased by majority class (a class with much larger sample size than its counterpart) classification even when the model fails to classify the minority class. However, because MCC utilizes all four numerics, MCC can represent the model's performance with less bias, especially in unbalanced datasets. MCC ranges from -1 to +1, with 0 indicating a random-chance classification. MCC of +1 indicates a perfect classification, and MCC of -1 indicates a reverse classification where every classification result is the opposite of ground truth. Both results indicate strong correlation, just in different directions. Popular libraries with MCC include mccr.

Last update: 2021-02-09