Machine learning algorithms operate on mathematical principles and typically require numerical input. As highlighted earlier, raw datasets frequently contain categorical features, representing qualitative attributes like 'color', 'city', 'product category', or 'customer segment'. These features, often represented as text strings or distinct identifiers, cannot be directly fed into most Scikit-learn estimators. Therefore, we need techniques to convert these categorical descriptions into a numerical format that algorithms can understand. This process is known as categorical feature encoding.
The goal is to transform non-numerical categories into numerical values without losing significant information or, importantly, introducing misleading relationships between categories. Scikit-learn provides several transformers within its preprocessing
module to handle this task. We will focus on two widely used strategies: One-Hot Encoding and Ordinal Encoding.
One-Hot Encoding is a common technique, particularly suitable for nominal categorical features, where categories have no inherent order or ranking (e.g., 'Red', 'Green', 'Blue'). It works by creating a new binary (0 or 1) feature for each unique category in the original feature. For a given observation, the column corresponding to its category will have a value of 1, while all other new columns for that original feature will have a value of 0.
Consider a feature named 'Color' with three possible categories: 'Red', 'Green', and 'Blue'. One-Hot Encoding would transform this single feature into three new features: 'Color_Red', 'Color_Green', and 'Color_Blue'.
Transformation of a 'Color' feature using One-Hot Encoding. Each category becomes a new binary column.
Advantages:
Disadvantages:
In Scikit-learn, sklearn.preprocessing.OneHotEncoder
is the primary tool for this transformation. It's designed to work within Scikit-learn pipelines and offers parameters to handle categories not seen during training (handle_unknown='ignore'
) or to control the output format (sparse_output=True
by default, which is memory-efficient for high-dimensional sparse data).
Ordinal Encoding assigns a unique integer to each category. For example, if a feature 'Size' has categories ['Small', 'Medium', 'Large'], Ordinal Encoding might map them to [0, 1, 2].
Advantages:
Disadvantages:
Scikit-learn provides sklearn.preprocessing.OrdinalEncoder
. You can explicitly define the order of categories using the categories
parameter; otherwise, it determines the mapping based on the order encountered during fitting (usually alphabetical).
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
# Sample data
data = pd.DataFrame({
'ID': [1, 2, 3, 4, 5],
'Color': ['Red', 'Green', 'Blue', 'Green', 'Red'],
'Size': ['Medium', 'Large', 'Small', 'Medium', 'Large']
})
# --- One-Hot Encoding Example ---
# Select categorical columns for One-Hot Encoding
ohe_cols = ['Color']
# Initialize encoder
# sparse_output=False gives a dense numpy array, easier to view here
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
# Fit and transform
ohe_transformed = ohe.fit_transform(data[ohe_cols])
# Get feature names for the new columns
ohe_feature_names = ohe.get_feature_names_out(ohe_cols)
# Create a DataFrame with the new features
ohe_df = pd.DataFrame(ohe_transformed, columns=ohe_feature_names, index=data.index)
print("Original Data:\n", data)
print("\nOne-Hot Encoded 'Color':\n", ohe_df)
# --- Ordinal Encoding Example ---
# Select categorical column for Ordinal Encoding
ord_cols = ['Size']
# Define the desired order for 'Size'
size_categories = ['Small', 'Medium', 'Large']
# Initialize encoder with specified categories
ordinal_encoder = OrdinalEncoder(categories=[size_categories]) # Note: categories expects a list of lists
# Fit and transform
ord_transformed = ordinal_encoder.fit_transform(data[ord_cols])
# Create a DataFrame (optional, often used directly as numpy array)
ord_df = pd.DataFrame(ord_transformed, columns=['Size_Encoded'], index=data.index)
print("\nOrdinal Encoded 'Size':\n", ord_df)
# Combine results (excluding original categorical columns)
final_df = pd.concat([data[['ID']], ohe_df, ord_df], axis=1)
print("\nCombined DataFrame:\n", final_df)
While Pandas offers a convenient pd.get_dummies
function for creating dummy/indicator variables (similar to One-Hot Encoding), using Scikit-learn's OneHotEncoder
within a Pipeline
(covered in Chapter 6) is generally preferred in a machine learning workflow. This ensures that the same categories identified during training are used consistently during testing or prediction, preventing errors caused by unseen categories or differing column sets.
Encoding categorical data is a fundamental preprocessing step. The choice of method depends on the nature of the data and the requirements of the machine learning algorithm you plan to use. Careful consideration here contributes significantly to building effective models.
© 2025 ApX Machine Learning