Machine learning models, especially complex ones, can often function as 'black boxes'. Their internal workings are not immediately obvious, yet understanding why they produce certain outputs ( $y$ ) given specific inputs ( $X$ ) is increasingly important for building trust, debugging errors, and ensuring fairness. This chapter establishes the fundamental concepts required to approach model interpretation.

We will begin by discussing the motivations behind explaining model predictions. Then, we will clarify the distinction between interpretability and explainability. Following this, we'll look at a classification of different interpretation methods, examining characteristics like whether they are built-in (intrinsic) or applied afterward (post-hoc), and whether they depend on the specific model type (model-specific) or can be used with any model (model-agnostic). We will also differentiate between understanding the overall model logic (global explanations) and explaining single predictions (local explanations). Finally, we will touch upon some common difficulties encountered when interpreting models.

Sections

1.1 Why Explain Model Predictions?
1.2 Interpretability vs. Explainability
1.3 Taxonomy of Interpretability Methods
1.4 Scope of Explanations: Global vs. Local
1.5 Challenges in Model Interpretation