As discussed in the chapter introduction, nominal categorical features represent categories without any inherent order or rank. Think of colors ('Red', 'Blue', 'Green'), city names ('London', 'Tokyo', 'New York'), or product types ('Electronics', 'Clothing', 'Groceries'). Machine learning algorithms generally require numerical input, so we need a way to translate these labels into numbers without imposing an artificial order. One of the most common and straightforward techniques for this is One-Hot Encoding (OHE).
The fundamental idea behind One-Hot Encoding is to create new binary features for each unique category present in the original nominal feature. For every observation (row) in your dataset, only one of these new binary features will be '1' (hot), indicating the presence of that specific category, while all other newly created binary features will be '0'.
Imagine a feature called 'Color' with three unique values: 'Red', 'Green', and 'Blue'. One-Hot Encoding transforms this single column into three new columns: 'Color_Red', 'Color_Green', and 'Color_Blue'.
This transformation effectively communicates the categorical information to the algorithm using numerical (0/1) values without implying any ordinal relationship between the categories.
Transformation of a 'Color' feature using One-Hot Encoding. Each unique category gets its own binary column.
Pandas provides a convenient function, get_dummies()
, for performing One-Hot Encoding directly on DataFrames or Series.
import pandas as pd
# Sample DataFrame
data = {'ID': [1, 2, 3, 4],
'Color': ['Red', 'Green', 'Blue', 'Green'],
'Value': [10, 15, 5, 12]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Apply One-Hot Encoding to the 'Color' column
df_encoded = pd.get_dummies(df, columns=['Color'], prefix='Color', prefix_sep='_')
print("\nDataFrame after One-Hot Encoding:")
print(df_encoded)
Output:
Original DataFrame:
ID Color Value
0 1 Red 10
1 2 Green 15
2 3 Blue 5
3 4 Green 12
DataFrame after One-Hot Encoding:
ID Value Color_Blue Color_Green Color_Red
0 1 10 0 0 1
1 2 15 0 1 0
2 3 5 1 0 0
3 4 12 0 1 0
Key parameters for pd.get_dummies()
:
data
: The DataFrame or Series to encode.columns
: A list of column names to encode. If None, attempts to encode all columns with object or category dtype.prefix
: A string or list of strings to append to the beginning of the new column names. Helps identify the original feature.prefix_sep
: Separator string between the prefix and the category name (default is '_').drop_first
: A boolean (default False). If set to True
, it drops the first category level for each feature. This is useful to avoid multicollinearity in some linear models, as the information for the dropped category is implicitly captured when all other category columns are 0. For example, if 'Color_Blue' was dropped, a row with Color_Green=0
and Color_Red=0
would imply the color is Blue.Scikit-learn's OneHotEncoder
(from sklearn.preprocessing
) is often preferred when building machine learning pipelines, especially for its ability to handle data consistently between training and testing phases.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
import numpy as np
# Sample DataFrame (same as before)
data = {'ID': [1, 2, 3, 4],
'Color': ['Red', 'Green', 'Blue', 'Green'],
'Value': [10, 15, 5, 12]}
df = pd.DataFrame(data)
# Select the categorical column(s) - Note: OneHotEncoder expects 2D array-like input
categorical_features = df[['Color']]
# Initialize the encoder
# sparse_output=False returns a dense NumPy array (easier to view)
# handle_unknown='ignore' prevents errors if unseen categories appear in test data
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
# Fit the encoder to the data (learns the categories) and transform it
encoded_data = encoder.fit_transform(categorical_features)
# Get the new feature names generated by the encoder
encoded_feature_names = encoder.get_feature_names_out(['Color'])
# Create a new DataFrame with the encoded features
df_encoded_sklearn = pd.DataFrame(encoded_data, columns=encoded_feature_names, index=df.index)
# Combine with the original non-categorical features
df_final = pd.concat([df.drop('Color', axis=1), df_encoded_sklearn], axis=1)
print("\nDataFrame after Scikit-learn OneHotEncoder:")
print(df_final)
# Example: Transforming new data (potentially with unseen category 'Yellow')
new_data = pd.DataFrame({'Color': ['Green', 'Red', 'Yellow']})
new_encoded = encoder.transform(new_data[['Color']])
print("\nEncoded new data:")
print(pd.DataFrame(new_encoded, columns=encoded_feature_names))
Output:
DataFrame after Scikit-learn OneHotEncoder:
ID Value Color_Blue Color_Green Color_Red
0 1 10 0.0 0.0 1.0
1 2 15 0.0 1.0 0.0
2 3 5 1.0 0.0 0.0
3 4 12 0.0 1.0 0.0
Encoded new data:
Color_Blue Color_Green Color_Red
0 0.0 1.0 0.0
1 0.0 0.0 1.0
2 0.0 0.0 0.0
Notice how 'Yellow', an unseen category, resulted in all zeros when handle_unknown='ignore'
was used. Other options for handle_unknown
include 'error'
(the default, raises an error) or assigning a specific encoded vector using infrequent_if_exist
.
Key parameters for sklearn.preprocessing.OneHotEncoder
:
categories
: Specify categories manually (default 'auto' learns from data).drop
: {'first', 'if_binary'} or None (default). Similar to drop_first
in Pandas, used to drop one category per feature. 'if_binary'
drops the first category only for features with two categories.sparse_output
: boolean (default True). Whether to return a sparse matrix (memory efficient for high dimensions) or a dense NumPy array. Set to False
for easier inspection or smaller datasets.handle_unknown
: {'error', 'ignore', 'infrequent_if_exist'}. How to handle categories encountered during transform
that were not seen during fit
. 'ignore'
outputs all zeros for the feature.drop_first=False
or drop=None
), the resulting features are perfectly multicollinear. The value of any one column can be perfectly predicted from the values of the others (if one category isn't present, the sum of its corresponding OHE columns will be 0; otherwise, it's 1). While many algorithms (especially tree-based ones like Random Forests or Gradient Boosting) handle this reasonably well, linear models (like Linear Regression, Logistic Regression) can have issues with coefficient stability and interpretation. Using drop_first=True
or drop='first'
mitigates this.fit
), both pd.get_dummies
(by default) and sklearn.OneHotEncoder
(with handle_unknown='error'
) will cause issues (either missing columns or errors). Using Scikit-learn's handle_unknown='ignore'
or 'infrequent_if_exist'
provides robust ways to manage this.OHE is a solid default choice for:
For features with very high cardinality, alternative encoding methods like Target Encoding, Hashing, or Binning (discussed later) might be more appropriate to manage the dimensionality.
© 2025 ApX Machine Learning