Real-world datasets are frequently incomplete. Machine learning algorithms typically cannot handle missing values directly, making data preparation a necessary first step. This chapter introduces techniques for managing missing data.
You will learn methods to identify missing entries within your data using Pandas. We will examine the common mechanisms that lead to missing data, such as Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). The focus then shifts to practical imputation strategies, starting with simple approaches like mean, median, and mode imputation. You will also learn to create indicator features that retain information about the original location of missing values. More advanced techniques, including K-Nearest Neighbors (KNN) Imputer and Iterative Imputer from Scikit-learn, will be covered for multivariate imputation. Finally, we will compare these different methods to help you choose the appropriate strategy for your specific situation. The chapter includes practical exercises applying these techniques.
2.1 Identifying Missing Values
2.2 Mechanisms of Missing Data (MCAR, MAR, MNAR)
2.3 Simple Imputation Strategies: Mean, Median, Mode
2.4 Creating Missing Value Indicators
2.5 Multivariate Imputation: KNN Imputer
2.6 Multivariate Imputation: Iterative Imputer
2.7 Comparing Imputation Methods
2.8 Hands-on Practical: Imputing Missing Data
© 2025 ApX Machine Learning