Before we explore specific metrics like accuracy or precision, let's first make sure we understand what kind of output a classification model gives us and how we arrive at a final prediction. Recall from Chapter 1 that classification tasks involve assigning inputs to predefined categories or classes. For instance, identifying if an email is "Spam" or "Not Spam," or classifying an image as containing a "Cat," "Dog," or "Bird."

From Probabilities to Predictions

Most classification algorithms don't just output a rigid class label directly. Instead, they often produce a score or a probability for each possible class. This probability represents the model's confidence that the input belongs to that specific class.

Consider a simple binary classification problem like spam detection. For a given email, the model might output something like:

Probability of being "Spam": 0.85
Probability of being "Not Spam": 0.15

Notice that these probabilities sum to 1.0 ( $0.85 + 0.15 = 1.0$ ). This is typical for many classification models. For a multi-class problem (e.g., classifying handwritten digits 0 through 9), the model would output ten probabilities, one for each digit, and these would also sum to 1.0.

The Decision Threshold

How do we go from these probabilities (like 0.85 for "Spam") to a definite prediction (the email is Spam)? We use a decision threshold.

The most common default threshold is 0.5. The rule is simple:

If the probability of the positive class (e.g., "Spam") is greater than the threshold (0.5), predict that class.
Otherwise, predict the negative class (e.g., "Not Spam").

In our example:

P(Spam) = 0.85
Threshold = 0.5
Since $0.85 > 0.5$ , the final prediction is "Spam".

If the model had outputted P(Spam) = 0.30 (and thus P(Not Spam) = 0.70), the prediction would be "Not Spam" because $0.30 \le 0.5$ .

While 0.5 is a standard starting point, this threshold is not set in stone. Depending on the specific goals and the consequences of different types of errors (which we'll discuss soon with precision and recall), you might choose to adjust this threshold. For example, if incorrectly classifying a non-spam email as spam is very problematic, you might increase the threshold (e.g., to 0.9) to be more certain before classifying an email as "Spam".

Ground Truth vs. Prediction

The core idea of evaluation is to compare the model's final predictions against the actual, known labels, often called the ground truth.

Ground Truth: The correct label for an input (e.g., we know a specific email truly is "Spam"). This comes from the labeled dataset used for evaluation.
Prediction: The label assigned by the model after applying the decision threshold (e.g., the model predicted "Spam" for that email).

The metrics we will explore next, starting with accuracy, are all calculated by systematically comparing these predictions to the ground truth across many data points in our test set. Understanding that classification involves this step of converting probabilities to discrete labels via a threshold is fundamental to interpreting these metrics correctly. Now, let's look at the simplest way to measure performance: accuracy.

Was this section helpful?