Alright, let's put the concepts from this chapter into practice. We've explored advanced visualizations and discussed the ideas behind feature engineering, scaling, and encoding. Now, we'll apply these techniques to a sample dataset, demonstrating how insights from prior analysis steps (like those covered in Chapters 3 and 4) can directly inform the creation of new features and how to prepare data for potential modeling. We'll conclude by outlining how to effectively summarize your EDA findings.
First, ensure you have the necessary libraries imported. We'll primarily use Pandas for data manipulation and Scikit-learn for transformations.
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.model_selection import train_test_split # Often used in conjunction, though not strictly EDA
# Let's create a sample DataFrame to work with
# Assume this DataFrame is the result of loading and initial cleaning (Chapter 2)
data = {
'Age': [25, 45, 30, 55, 22, 38, 60, 29, 41, 50],
'Salary': [50000, 80000, 60000, 110000, 45000, 75000, 120000, 58000, 78000, 95000],
'Department': ['HR', 'IT', 'Sales', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR', 'IT'],
'Experience': [2, 20, 5, 30, 1, 15, 35, 4, 18, 25],
'JoinDate': pd.to_datetime(['2021-03-15', '2003-07-20', '2018-11-01', '1993-05-10', '2022-01-30', '2008-09-12', '1988-02-28', '2019-06-05', '2005-10-22', '1998-04-18'])
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df.head())
print("\nDataFrame Info:")
df.info()
Our previous analysis (univariate and bivariate) might have suggested certain relationships or characteristics worth capturing explicitly as new features.
1. Interaction Features
If scatter plots or correlation analysis (Chapter 4) hinted that the combined effect of two variables is significant, we can create an interaction term. For instance, maybe the 'Salary' potential increases faster for older employees with more experience. A simple interaction term could be the product of 'Age' and 'Experience'.
df['Age_Experience_Interaction'] = df['Age'] * df['Experience']
print("\nDataFrame with Age-Experience Interaction:")
print(df[['Age', 'Experience', 'Age_Experience_Interaction']].head())
2. Polynomial Features
If visualizations like scatter plots showed a curved relationship between a feature and a target (or another feature), polynomial features might help capture this non-linearity. Let's create squared terms for 'Age' and 'Experience'. While Scikit-learn's PolynomialFeatures
is powerful, we can do simple ones directly with Pandas.
df['Age_Squared'] = df['Age']**2
df['Experience_Squared'] = df['Experience']**2
print("\nDataFrame with Squared Features:")
print(df[['Age', 'Age_Squared', 'Experience', 'Experience_Squared']].head())
Alternatively, using Scikit-learn's PolynomialFeatures
is useful for generating combinations and higher degrees systematically.
# Example using PolynomialFeatures (optional, often used in modeling pipelines)
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)
# Select numerical columns to transform
numerical_cols = ['Age', 'Experience']
poly_features = poly.fit_transform(df[numerical_cols])
# Get feature names for new polynomial features
poly_feature_names = poly.get_feature_names_out(numerical_cols)
# Create a DataFrame with these new features
df_poly = pd.DataFrame(poly_features, columns=poly_feature_names, index=df.index)
# You could merge this back, careful about duplicate columns (original Age, Experience)
# df = pd.concat([df, df_poly.drop(columns=numerical_cols)], axis=1) # Example merge strategy
print("\nPolynomial Features generated by Scikit-learn (degree 2):")
print(df_poly.head())
3. Binning Numerical Data
Histograms (Chapter 3) might show distinct groups within a numerical feature. Binning 'Age' into categories like 'Young', 'Mid-career', 'Senior' can sometimes be more informative or work better with certain models.
# Define age bins and labels
age_bins = [0, 30, 50, df['Age'].max()] # Bins: (0, 30], (30, 50], (50, max]
age_labels = ['Young', 'Mid-career', 'Senior']
df['Age_Group'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, right=True)
print("\nDataFrame with Age Groups:")
print(df[['Age', 'Age_Group']].head())
# Check the counts in each new category
print("\nCounts per Age Group:")
print(df['Age_Group'].value_counts())
4. Extracting Information from Datetime Features
Datetime columns often contain valuable information that isn't immediately usable in its raw format. We can extract the year, month, day of the week, etc.
df['Join_Year'] = df['JoinDate'].dt.year
df['Join_Month'] = df['JoinDate'].dt.month
df['Join_DayOfWeek'] = df['JoinDate'].dt.dayofweek # Monday=0, Sunday=6
print("\nDataFrame with Extracted Date Features:")
print(df[['JoinDate', 'Join_Year', 'Join_Month', 'Join_DayOfWeek']].head())
After creating features, or sometimes as part of preparing existing ones, we often need to transform them.
1. Scaling Numerical Features
Many machine learning algorithms perform better when numerical features are on a similar scale. StandardScaler
standardizes features to have zero mean and unit variance (z=(x−μ)/σ), while MinMaxScaler
scales features to a fixed range, typically [0, 1] (xscaled=(x−min(x))/(max(x)−min(x))).
Let's apply StandardScaler
to 'Salary' and 'Age_Experience_Interaction'.
scaler_std = StandardScaler()
# Select columns to scale
cols_to_scale = ['Salary', 'Age_Experience_Interaction']
# Fit and transform the data
# Note: In practice, fit on training data, transform train and test data
df[cols_to_scale + '_StdScaled'] = scaler_std.fit_transform(df[cols_to_scale])
print("\nDataFrame with Standard Scaled Features:")
print(df[['Salary', 'Salary_StdScaled', 'Age_Experience_Interaction', 'Age_Experience_Interaction_StdScaled']].head())
Now, let's apply MinMaxScaler
to 'Experience'.
scaler_minmax = MinMaxScaler()
df['Experience_MinMaxScaled'] = scaler_minmax.fit_transform(df[['Experience']])
print("\nDataFrame with MinMax Scaled Feature:")
print(df[['Experience', 'Experience_MinMaxScaled']].head())
2. Encoding Categorical Features
Machine learning models require numerical input. We need to convert categorical features like 'Department' and our newly created 'Age_Group' into a numerical format. One-Hot Encoding is a common technique that creates new binary (0 or 1) columns for each category.
# Using Pandas get_dummies (simpler for direct DataFrame manipulation)
df = pd.get_dummies(df, columns=['Department', 'Age_Group'], prefix=['Dept', 'AgeGrp'], drop_first=False)
# drop_first=True can be used to avoid multicollinearity if needed by the model
print("\nDataFrame after One-Hot Encoding:")
# Display relevant columns - original are dropped by get_dummies
print(df.filter(regex='Dept_|AgeGrp_').head())
print("\nFinal DataFrame columns:")
print(df.columns)
Note: While pd.get_dummies
is convenient during EDA, Scikit-learn's OneHotEncoder
is often preferred in machine learning pipelines, especially when dealing with training and testing splits, as it can handle categories seen only in the test set (if configured) and integrates smoothly with other Scikit-learn transformers.
The final step of EDA isn't just stopping after the analysis; it's about synthesizing and communicating your discoveries. A good EDA summary provides a clear overview of the data's characteristics, quality, relationships found, and any features created.
Structure of an EDA Summary:
Key Principles for Summarizing:
This practical exercise demonstrated how the exploratory cycle continues. Insights lead to feature creation, which might prompt further analysis or transformations, culminating in a structured summary that captures the essence of the dataset and prepares the ground for subsequent modeling or decision-making.
© 2025 ApX Machine Learning