Now that we've examined the theoretical underpinnings of handling missing data, let's apply these techniques using Python's data science stack. This practical section will guide you through implementing various imputation methods using Pandas and Scikit-learn on a sample dataset. Understanding how to apply these methods is fundamental to preparing data for machine learning models.
First, let's set up our environment by importing the necessary libraries and creating a sample DataFrame with missing values.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer, MissingIndicator
from sklearn.experimental import enable_iterative_imputer # Enable IterativeImputer
from sklearn.impute import IterativeImputer
import matplotlib.pyplot as plt
import seaborn as sns
# Create a sample DataFrame with missing values
data = {
'Age': [25, 30, np.nan, 35, 40, 45, 50, np.nan, 55],
'Salary': [50000, 60000, 75000, np.nan, 80000, 90000, 110000, 65000, np.nan],
'Experience': [1, 5, 3, 10, 15, 20, np.nan, 8, 30],
'Department': ['HR', 'IT', 'Finance', 'IT', np.nan, 'HR', 'Finance', 'IT', 'Finance'],
'Rating': [3.5, 4.0, 4.5, 3.0, np.nan, 4.2, 3.8, 4.8, 3.9]
}
df = pd.DataFrame(data)
print("Original DataFrame with Missing Values:")
print(df)
print("\nMissing values per column:")
print(df.isnull().sum())
Our sample df
contains missing values (np.nan
) in numerical (Age
, Salary
, Experience
, Rating
) and categorical (Department
) columns.
As discussed earlier, simple imputation involves replacing missing values using basic statistical measures. Scikit-learn's SimpleImputer
is a convenient tool for this.
We typically use the mean for normally distributed data and the median for skewed data or data with outliers. Let's apply both to different columns for demonstration.
# Impute 'Age' with mean
mean_imputer = SimpleImputer(strategy='mean')
# Reshape is needed as SimpleImputer expects 2D array
df['Age_mean_imputed'] = mean_imputer.fit_transform(df[['Age']])
# Impute 'Salary' with median (often better for salary data)
median_imputer = SimpleImputer(strategy='median')
df['Salary_median_imputed'] = median_imputer.fit_transform(df[['Salary']])
# Impute 'Experience' and 'Rating' together with median
num_cols_median = ['Experience', 'Rating']
median_imputer_multi = SimpleImputer(strategy='median')
# Fit on the original columns
median_imputer_multi.fit(df[num_cols_median])
# Transform and create new columns
df[['Experience_median_imputed', 'Rating_median_imputed']] = median_imputer_multi.transform(df[num_cols_median])
print("\nDataFrame after Mean/Median Imputation:")
print(df[['Age', 'Age_mean_imputed', 'Salary', 'Salary_median_imputed', 'Experience', 'Experience_median_imputed', 'Rating', 'Rating_median_imputed']].head())
Observe how NaN
values in the original columns are replaced by the calculated mean or median in the corresponding new columns.
For categorical features like Department
, the most frequent value (mode) is commonly used for imputation.
# Impute 'Department' with mode
mode_imputer = SimpleImputer(strategy='most_frequent')
df['Department_mode_imputed'] = mode_imputer.fit_transform(df[['Department']])
print("\nDataFrame after Mode Imputation:")
print(df[['Department', 'Department_mode_imputed']].head(6)) # Show row with original NaN
The missing department is filled with the most common department found in the column.
Sometimes, the fact that a value was missing is informative in itself. We can capture this using indicator features. SimpleImputer
can do this automatically, or we can use MissingIndicator
.
# Using SimpleImputer with add_indicator=True
median_imputer_indicator = SimpleImputer(strategy='median', add_indicator=True)
imputed_with_indicator = median_imputer_indicator.fit_transform(df[['Salary']]) # Use original Salary
# The output is a NumPy array: column 0 is imputed data, column 1 is the indicator
df['Salary_median_imputed_si'] = imputed_with_indicator[:, 0]
df['Salary_missing_indicator_si'] = imputed_with_indicator[:, 1].astype(int) # Convert boolean to int
# Using MissingIndicator directly
indicator = MissingIndicator(features='all') # Check all features
missing_indicators = indicator.fit_transform(df[['Age', 'Salary', 'Experience', 'Department', 'Rating']])
# Convert to DataFrame for clarity
indicator_df = pd.DataFrame(missing_indicators, columns=[f'{col}_missing' for col in df.columns if df[col].isnull().any()], index=df.index)
# Combine with original df (optional, for viewing)
df_with_indicators = pd.concat([df, indicator_df], axis=1)
print("\nDataFrame with Salary Imputation and Indicator (from SimpleImputer):")
print(df[['Salary', 'Salary_median_imputed_si', 'Salary_missing_indicator_si']].head())
print("\nDataFrame showing all generated Missing Indicators (from MissingIndicator):")
print(df_with_indicators[['Age', 'Age_missing', 'Salary', 'Salary_missing', 'Experience', 'Experience_missing', 'Department', 'Department_missing', 'Rating', 'Rating_missing']].head(6))
These binary indicator columns explicitly signal where data was originally missing, which might be useful for certain models.
Multivariate methods use information from other features to estimate missing values, potentially leading to more accurate imputations than simple strategies.
KNNImputer
fills missing values using the average value from the k nearest neighbors found in the training set. Neighbors are identified based on the features that are not missing. This requires all features used for imputation to be numerical. We'll first need to encode the Department
column (e.g., using one-hot encoding, covered in the next chapter) or exclude it. For simplicity here, let's impute only the numerical features together.
from sklearn.preprocessing import MinMaxScaler
# KNNImputer is sensitive to feature scaling, so scale first
numerical_cols = ['Age', 'Salary', 'Experience', 'Rating']
df_numerical = df[numerical_cols].copy()
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_numerical), columns=numerical_cols)
# Apply KNNImputer
knn_imputer = KNNImputer(n_neighbors=3) # Use 3 neighbors
df_knn_imputed_scaled = pd.DataFrame(knn_imputer.fit_transform(df_scaled), columns=numerical_cols)
# Inverse transform to get data back to original scale
df_knn_imputed = pd.DataFrame(scaler.inverse_transform(df_knn_imputed_scaled), columns=numerical_cols)
# Add imputed columns back to original df for comparison (optional)
df['Age_knn_imputed'] = df_knn_imputed['Age']
df['Salary_knn_imputed'] = df_knn_imputed['Salary']
df['Experience_knn_imputed'] = df_knn_imputed['Experience']
df['Rating_knn_imputed'] = df_knn_imputed['Rating']
print("\nDataFrame after KNN Imputation (showing original and imputed side-by-side):")
# Display rows where original data was missing to see the imputed values
missing_rows_idx = df[df[numerical_cols].isnull().any(axis=1)].index
print(df.loc[missing_rows_idx, ['Age', 'Age_knn_imputed', 'Salary', 'Salary_knn_imputed', 'Experience', 'Experience_knn_imputed', 'Rating', 'Rating_knn_imputed']])
Note that KNN Imputation requires careful consideration of the number of neighbors (k) and the distance metric used. Scaling features beforehand is generally recommended.
IterativeImputer
models each feature with missing values as a function of other features and uses an iterative approach to estimate the missing values. It cycles through predicting missing values for each feature based on all others until the estimates stabilize.
# IterativeImputer also works better with scaled data usually
# We can reuse the scaled data from the KNN example
iterative_imputer = IterativeImputer(max_iter=10, random_state=0) # max_iter controls iterations
df_iterative_imputed_scaled = pd.DataFrame(iterative_imputer.fit_transform(df_scaled), columns=numerical_cols)
# Inverse transform
df_iterative_imputed = pd.DataFrame(scaler.inverse_transform(df_iterative_imputed_scaled), columns=numerical_cols)
# Add imputed columns back to original df
df['Age_iterative_imputed'] = df_iterative_imputed['Age']
df['Salary_iterative_imputed'] = df_iterative_imputed['Salary']
df['Experience_iterative_imputed'] = df_iterative_imputed['Experience']
df['Rating_iterative_imputed'] = df_iterative_imputed['Rating']
print("\nDataFrame after Iterative Imputation (showing original and imputed side-by-side):")
print(df.loc[missing_rows_idx, ['Age', 'Age_iterative_imputed', 'Salary', 'Salary_iterative_imputed', 'Experience', 'Experience_iterative_imputed', 'Rating', 'Rating_iterative_imputed']])
IterativeImputer
is often more sophisticated but can be computationally more intensive than KNNImputer
.
The choice of imputation method depends on the data characteristics, the mechanism of missingness (if known), and the specific requirements of the machine learning model.
Let's visualize the distribution of the 'Salary' feature before and after different imputations to see the impact.
# Prepare data for plotting
salary_data = pd.DataFrame({
'Original': df['Salary'],
'Median Imputed': df['Salary_median_imputed'],
'KNN Imputed': df['Salary_knn_imputed'],
'Iterative Imputed': df['Salary_iterative_imputed']
})
# Melt the DataFrame for Seaborn plotting
salary_melted = salary_data.melt(var_name='Imputation Method', value_name='Salary')
# Create the plot
plt.figure(figsize=(12, 6))
sns.kdeplot(data=salary_melted, x='Salary', hue='Imputation Method', fill=True, common_norm=False, palette="viridis")
plt.title('Distribution of Salary After Different Imputation Methods')
plt.xlabel('Salary')
plt.ylabel('Density')
plt.show()
Comparison of Salary distributions using violin plots after different imputation methods. Original data includes nulls. Median imputation adds points at the median value. KNN and Iterative imputation provide potentially more nuanced estimates based on other features.
The best approach often involves experimentation and evaluating the impact on downstream model performance. Consider the trade-offs between imputation accuracy, computational cost, and the potential distortions introduced into your dataset. Remember to fit imputers only on the training data and use the fitted imputer to transform both training and testing datasets to prevent data leakage. This is often best managed using Scikit-learn Pipelines.
© 2025 ApX Machine Learning