Before we dive into specific metrics like accuracy or precision, let's first make sure we understand what kind of output a classification model gives us and how we arrive at a final prediction. Recall from Chapter 1 that classification tasks involve assigning inputs to predefined categories or classes. For instance, identifying if an email is "Spam" or "Not Spam," or classifying an image as containing a "Cat," "Dog," or "Bird."
Most classification algorithms don't just output a rigid class label directly. Instead, they often produce a score or a probability for each possible class. This probability represents the model's confidence that the input belongs to that specific class.
Consider a simple binary classification problem like spam detection. For a given email, the model might output something like:
Notice that these probabilities sum to 1.0 (0.85+0.15=1.0). This is typical for many classification models. For a multi-class problem (e.g., classifying handwritten digits 0 through 9), the model would output ten probabilities, one for each digit, and these would also sum to 1.0.
How do we go from these probabilities (like 0.85 for "Spam") to a definite prediction (the email is Spam)? We use a decision threshold.
The most common default threshold is 0.5. The rule is simple:
In our example:
If the model had outputted P(Spam) = 0.30 (and thus P(Not Spam) = 0.70), the prediction would be "Not Spam" because 0.30≤0.5.
While 0.5 is a standard starting point, this threshold is not set in stone. Depending on the specific goals and the consequences of different types of errors (which we'll discuss soon with precision and recall), you might choose to adjust this threshold. For example, if incorrectly classifying a non-spam email as spam is very problematic, you might increase the threshold (e.g., to 0.9) to be more certain before classifying an email as "Spam".
The core idea of evaluation is to compare the model's final predictions against the actual, known labels, often called the ground truth.
The metrics we will explore next, starting with accuracy, are all calculated by systematically comparing these predictions to the ground truth across many data points in our test set. Understanding that classification involves this step of converting probabilities to discrete labels via a threshold is fundamental to interpreting these metrics correctly. Now, let's look at the simplest way to measure performance: accuracy.
© 2025 ApX Machine Learning