All Courses

Hands-on Practical: Feature Creation and Summary

Alright, let's put the concepts from this chapter into practice. We've explored advanced visualizations and discussed the ideas behind feature engineering, scaling, and encoding. Now, we'll apply these techniques to a sample dataset, demonstrating how insights from prior analysis steps (like those covered in Chapters 3 and 4) can directly inform the creation of new features and how to prepare data for potential modeling. We'll conclude by outlining how to effectively summarize your EDA findings.

First, ensure you have the necessary libraries imported. We'll primarily use Pandas for data manipulation and Scikit-learn for transformations.

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.model_selection import train_test_split # Often used in conjunction, though not strictly EDA

# Let's create a sample DataFrame to work with
# Assume this DataFrame is the result of loading and initial cleaning (Chapter 2)
data = {
    'Age': [25, 45, 30, 55, 22, 38, 60, 29, 41, 50],
    'Salary': [50000, 80000, 60000, 110000, 45000, 75000, 120000, 58000, 78000, 95000],
    'Department': ['HR', 'IT', 'Sales', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR', 'IT'],
    'Experience': [2, 20, 5, 30, 1, 15, 35, 4, 18, 25],
    'JoinDate': pd.to_datetime(['2021-03-15', '2003-07-20', '2018-11-01', '1993-05-10', '2022-01-30', '2008-09-12', '1988-02-28', '2019-06-05', '2005-10-22', '1998-04-18'])
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df.head())
print("\nDataFrame Info:")
df.info()

Creating New Features Based on EDA Insights

Our previous analysis (univariate and bivariate) might have suggested certain relationships or characteristics worth capturing explicitly as new features.

1. Interaction Features

If scatter plots or correlation analysis (Chapter 4) hinted that the combined effect of two variables is significant, we can create an interaction term. For instance, maybe the 'Salary' potential increases faster for older employees with more experience. A simple interaction term could be the product of 'Age' and 'Experience'.

df['Age_Experience_Interaction'] = df['Age'] * df['Experience']
print("\nDataFrame with Age-Experience Interaction:")
print(df[['Age', 'Experience', 'Age_Experience_Interaction']].head())

2. Polynomial Features

If visualizations like scatter plots showed a curved relationship between a feature and a target (or another feature), polynomial features might help capture this non-linearity. Let's create squared terms for 'Age' and 'Experience'. While Scikit-learn's PolynomialFeatures is powerful, we can do simple ones directly with Pandas.

df['Age_Squared'] = df['Age']**2
df['Experience_Squared'] = df['Experience']**2
print("\nDataFrame with Squared Features:")
print(df[['Age', 'Age_Squared', 'Experience', 'Experience_Squared']].head())

Alternatively, using Scikit-learn's PolynomialFeatures is useful for generating combinations and higher degrees systematically.

# Example using PolynomialFeatures (optional, often used in modeling pipelines)
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)
# Select numerical columns to transform
numerical_cols = ['Age', 'Experience']
poly_features = poly.fit_transform(df[numerical_cols])
# Get feature names for new polynomial features
poly_feature_names = poly.get_feature_names_out(numerical_cols)
# Create a DataFrame with these new features
df_poly = pd.DataFrame(poly_features, columns=poly_feature_names, index=df.index)

# You could merge this back, careful about duplicate columns (original Age, Experience)
# df = pd.concat([df, df_poly.drop(columns=numerical_cols)], axis=1) # Example merge strategy
print("\nPolynomial Features generated by Scikit-learn (degree 2):")
print(df_poly.head())

3. Binning Numerical Data

Histograms (Chapter 3) might show distinct groups within a numerical feature. Binning 'Age' into categories like 'Young', 'Mid-career', 'Senior' can sometimes be more informative or work better with certain models.

# Define age bins and labels
age_bins = [0, 30, 50, df['Age'].max()] # Bins: (0, 30], (30, 50], (50, max]
age_labels = ['Young', 'Mid-career', 'Senior']

df['Age_Group'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, right=True)
print("\nDataFrame with Age Groups:")
print(df[['Age', 'Age_Group']].head())
# Check the counts in each new category
print("\nCounts per Age Group:")
print(df['Age_Group'].value_counts())

4. Extracting Information from Datetime Features

Datetime columns often contain valuable information that isn't immediately usable in its raw format. We can extract the year, month, day of the week, etc.

df['Join_Year'] = df['JoinDate'].dt.year
df['Join_Month'] = df['JoinDate'].dt.month
df['Join_DayOfWeek'] = df['JoinDate'].dt.dayofweek # Monday=0, Sunday=6

print("\nDataFrame with Extracted Date Features:")
print(df[['JoinDate', 'Join_Year', 'Join_Month', 'Join_DayOfWeek']].head())

Applying Data Transformations

After creating features, or sometimes as part of preparing existing ones, we often need to transform them.

1. Scaling Numerical Features

Many machine learning algorithms perform better when numerical features are on a similar scale. StandardScaler standardizes features to have zero mean and unit variance ( $z = (x - \mu) / \sigma$ ), while MinMaxScaler scales features to a fixed range, typically [0, 1] ( $x_{scaled} = (x - min(x)) / (max(x) - min(x))$ ).

Let's apply StandardScaler to 'Salary' and 'Age_Experience_Interaction'.

scaler_std = StandardScaler()
# Select columns to scale
cols_to_scale = ['Salary', 'Age_Experience_Interaction']
# Fit and transform the data
# Note: In practice, fit on training data, transform train and test data
df[cols_to_scale + '_StdScaled'] = scaler_std.fit_transform(df[cols_to_scale])

print("\nDataFrame with Standard Scaled Features:")
print(df[['Salary', 'Salary_StdScaled', 'Age_Experience_Interaction', 'Age_Experience_Interaction_StdScaled']].head())

Now, let's apply MinMaxScaler to 'Experience'.

scaler_minmax = MinMaxScaler()
df['Experience_MinMaxScaled'] = scaler_minmax.fit_transform(df[['Experience']])

print("\nDataFrame with MinMax Scaled Feature:")
print(df[['Experience', 'Experience_MinMaxScaled']].head())

2. Encoding Categorical Features

Machine learning models require numerical input. We need to convert categorical features like 'Department' and our newly created 'Age_Group' into a numerical format. One-Hot Encoding is a common technique that creates new binary (0 or 1) columns for each category.

# Using Pandas get_dummies (simpler for direct DataFrame manipulation)
df = pd.get_dummies(df, columns=['Department', 'Age_Group'], prefix=['Dept', 'AgeGrp'], drop_first=False)
# drop_first=True can be used to avoid multicollinearity if needed by the model

print("\nDataFrame after One-Hot Encoding:")
# Display relevant columns - original are dropped by get_dummies
print(df.filter(regex='Dept_|AgeGrp_').head())
print("\nFinal DataFrame columns:")
print(df.columns)

Note: While pd.get_dummies is convenient during EDA, Scikit-learn's OneHotEncoder is often preferred in machine learning pipelines, especially when dealing with training and testing splits, as it can handle categories seen only in the test set (if configured) and integrates smoothly with other Scikit-learn transformers.

Summarizing and Reporting EDA Findings

The final step of EDA isn't just stopping after the analysis; it's about synthesizing and communicating your discoveries. A good EDA summary provides a clear overview of the data's characteristics, quality, relationships found, and any features created.

Structure of an EDA Summary:

Introduction: State the goals of the analysis (e.g., understand customer demographics, identify drivers of sales, prepare data for churn prediction). Mention the data source(s).
Data Description & Cleaning: Briefly describe the dataset (number of rows, columns, general meaning of features). Detail the major data quality issues encountered (missing values, duplicates, outliers) and how they were addressed (e.g., "15% missing values in 'Income' imputed using the median", "Removed 55 duplicate entries").
Univariate Analysis Highlights: Summarize important characteristics of individual variables.
- Mention distributions (e.g., "Age is approximately normally distributed", "Revenue is highly right-skewed").
- Report central tendency and dispersion for important numerical features.
- Show frequency counts or proportions for important categorical features.
- Note any significant outliers identified and how they were handled or why they were kept.
Bivariate & Multivariate Analysis Highlights: Focus on the most significant relationships discovered.
- Report strong correlations (e.g., "Found a strong positive correlation (r=0.85) between 'Study Time' and 'Exam Score'"). Use heatmaps for summarizing many correlations.
- Describe relationships between numerical and categorical variables (e.g., "Average 'Salary' was significantly higher for the 'IT' department compared to 'Sales', as seen in box plots").
- Highlight findings from comparing categorical variables (e.g., "Cross-tabulation showed a higher proportion of 'Senior' employees in the 'IT' department").
- Reference specific plots (scatter plots, grouped bar charts, pair plots) that illustrate these relationships.
Feature Engineering: Explain any new features created, justifying why they were created based on the analysis (e.g., "Created 'Age_Group' bins because the relationship between 'Age' and 'Purchase Frequency' appeared non-linear", "Extracted 'Join_Month' as seasonality is suspected"). Mention any transformations applied (scaling, encoding) and their purpose.
Conclusions & Next Steps: Summarize the main takeaways. Reiterate findings relevant to the initial goals. Suggest potential next steps, such as specific modeling approaches, areas needing further data collection, or hypotheses generated that require more formal testing.

Principles for Summarizing:

Be Selective: Focus on the most important and actionable insights. Don't describe every single plot or statistic.
Be Clear and Concise: Use straightforward language. Avoid jargon where possible, or explain it if necessary.
Visualize: Embed important visualizations directly into your report or summary document. A plot often conveys information more effectively than text alone.
Connect to Goals: Frame your findings in the context of the original analysis objectives.
Document Assumptions: Note any assumptions made during cleaning or feature engineering.

This practical exercise demonstrated how the exploratory cycle continues. Insights lead to feature creation, which might prompt further analysis or transformations, culminating in a structured summary that captures the essence of the dataset and prepares the ground for subsequent modeling or decision-making.

Was this section helpful?