All Courses

Common Evaluation Metrics for Classification and Regression

Okay, you've spent time building your neural network, feeding it data, and guiding it through the training process. But how do you know if your model is actually any good? Just because the training loop completed without errors doesn't mean the model has learned anything useful. This is where evaluation metrics come into play. They provide quantitative measures of your model's performance, allowing you to objectively assess its effectiveness on a given task, whether it's classifying images or predicting house prices. Understanding these metrics is fundamental to comparing different models, tuning hyperparameters, and ultimately, deciding if your model is ready for its intended application.

When your model's task is to assign a category or class label to an input (e.g., "spam" or "not spam," "cat" or "dog"), you're dealing with a classification problem. Several metrics can help you understand how well your classifier is performing.

Accuracy

Accuracy is perhaps the most intuitive classification metric. It simply measures the proportion of predictions your model got right.

\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}

While straightforward, accuracy can be misleading, especially when dealing with imbalanced datasets. For instance, if 95% of your emails are not spam, a model that always predicts "not spam" will achieve 95% accuracy but will be useless for identifying actual spam.

The Confusion Matrix

To get a more detailed picture of a classifier's performance, especially with imbalanced classes, we often turn to the confusion matrix. It's a table that summarizes the predictions made by a classifier by breaking them down into four categories based on the actual and predicted classes:

True Positives (TP): The model correctly predicted the positive class. (e.g., predicted "spam" and it was spam).
True Negatives (TN): The model correctly predicted the negative class. (e.g., predicted "not spam" and it was not spam).
False Positives (FP): The model incorrectly predicted the positive class when it was actually negative (a "Type I error"). (e.g., predicted "spam" but it was not spam).
False Negatives (FN): The model incorrectly predicted the negative class when it was actually positive (a "Type II error"). (e.g., predicted "not spam" but it was spam).

A typical layout of a confusion matrix, showing the relationship between actual and predicted class labels.

From these four values derived from the confusion matrix, we can calculate several other informative metrics.

Precision

Precision answers the question: "Of all the instances the model predicted as positive, how many were actually positive?" It's a measure of exactness for positive predictions.

\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}

High precision is important when the cost of a false positive is high. For example, in spam detection, you want high precision to avoid marking legitimate emails (actual negatives) as spam (predicted positives).

Recall (Sensitivity or True Positive Rate)

Recall answers the question: "Of all the actual positive instances, how many did the model correctly identify?" It's a measure of completeness for positive instances.

\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}

High recall is important when the cost of a false negative is high. For instance, in medical diagnosis for a serious disease (positive class), high recall is desired to minimize the chances of missing an actual case (an actual positive being predicted as negative).

F1-Score

Often, there's a trade-off between precision and recall. Improving one might degrade the other. The F1-score provides a way to balance both by calculating their harmonic mean. It's particularly useful when you have an uneven class distribution or when you need a single number to represent both precision and recall.

F_1\text{-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

The F1-score ranges from 0 to 1, with 1 being the best possible score (perfect precision and recall). It punishes extreme values more than a simple average would, meaning if either precision or recall is very low, the F1-score will also be low.

If your model is predicting a continuous numerical value, such as the price of a stock or the temperature tomorrow, you're working on a regression problem. The evaluation metrics here focus on the magnitude of the errors between predicted and actual values.

Mean Absolute Error (MAE)

The Mean Absolute Error measures the average absolute difference between the predicted values ( $\hat{y}_i$ ) and the actual values ( $y_i$ ) over $n$ samples.

\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

MAE is easy to interpret as it's in the same units as the target variable. It gives an average sense of how far off your predictions are, treating all errors with equal weight in terms of their magnitude.

Mean Squared Error (MSE)

The Mean Squared Error calculates the average of the squared differences between predicted ( $\hat{y}_i$ ) and actual values ( $y_i$ ).

\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

By squaring the errors, MSE penalizes larger errors more heavily than MAE. This can be desirable if large errors are particularly problematic for your application. However, its units are the square of the target variable's units (e.g., dollars squared if predicting price), making it less directly interpretable than MAE.

Root Mean Squared Error (RMSE)

To bring the error metric back to the original units of the target variable, we often use the Root Mean Squared Error, which is simply the square root of the MSE.

\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}

Like MSE, RMSE penalizes large errors significantly but has the advantage of being in the same units as the target variable, making it more interpretable than MSE. It's one of the most popular metrics for regression tasks.

R-squared ( $R^2$ ) (Coefficient of Determination)

R-squared, or the coefficient of determination, measures the proportion of the variance in the dependent variable ( $y$ ) that is predictable from the independent variables (features used by the model). In simpler terms, it tells you how well your model's predictions approximate the actual values compared to a simple baseline model that always predicts the mean of the target values ( $\bar{y}$ ).

R^2 = 1 - \frac{\text{Sum of Squared Residuals (SSR)}}{\text{Total Sum of Squares (SST)}} = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}

An $R^2$ value typically ranges from 0 to 1 for reasonable models:

An $R^2$ of 1 indicates that the model perfectly predicts the target variable (all variance is explained).
An $R^2$ of 0 indicates that the model performs no better than a model that always predicts the mean of the target variable.
It's possible to get a negative $R^2$ value if the model performs worse than predicting the mean, which often indicates a very poor model fit.

While a high $R^2$ is generally good, it doesn't necessarily mean the model is a good fit for new data or that the chosen model is appropriate. For instance, $R^2$ can be artificially inflated by adding more features. Always consider it alongside other metrics and diagnostic plots.

Choosing the Right Metric

The selection of an appropriate evaluation metric is not always straightforward and heavily depends on the specific goals of your deep learning project and the characteristics of your dataset. For example:

In a medical screening test for a rare but severe disease, maximizing recall (minimizing false negatives) might be more important than precision, even if it means more false alarms that need further investigation.
For a system recommending products, precision might be prioritized to ensure that recommended items are highly relevant, avoiding user annoyance from poor suggestions (false positives).
When predicting house prices, RMSE might be preferred if large errors are particularly undesirable because of its quadratic penalty, while MAE might give a better sense of the typical error magnitude in monetary terms.

It's common to monitor multiple metrics during model development. Understanding the trade-offs and implications of each will help you make informed decisions about your model's quality and readiness. Many of these metrics can be readily computed using functions available in Julia's data science and machine learning packages, such as those in MLJBase.jl or StatsBase.jl, or they can be implemented directly if needed for custom scenarios within your Flux.jl training loops.

Was this section helpful?