While methods like Standardization and Normalization adjust the scale of features, and transformations like Log or Box-Cox attempt to make distributions more Gaussian-like, Quantile Transformation takes a different approach. It's a non-linear transformation that maps the probability distribution of a feature to another specific distribution (either uniform or normal), regardless of the original distribution's shape. This is achieved by leveraging the ranks, or quantiles, of the data points.
The core idea behind quantile transformation is to estimate the empirical cumulative distribution function (CDF) of a feature and then use this CDF to map the original values to the desired output distribution.
Because this method relies on the rank order of the data points rather than their absolute values, it is inherently strong to outliers. Outliers will be mapped to the extreme ends of the target distribution (e.g., close to 0 or 1 for uniform, or large negative/positive values for normal) but won't disproportionately affect the transformation of other points, unlike StandardScaler
or MinMaxScaler
.
Scikit-learn provides the sklearn.preprocessing.QuantileTransformer
class for this purpose. Let's see how to use it.
import numpy as np
import pandas as pd
from sklearn.preprocessing import QuantileTransformer
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Generate some skewed data
np.random.seed(42)
data_original = np.random.exponential(scale=2, size=1000).reshape(-1, 1) + 1 # Add 1 to avoid issues with zero if using log later
# Initialize transformers
qt_uniform = QuantileTransformer(output_distribution='uniform', n_quantiles=1000, random_state=42)
qt_normal = QuantileTransformer(output_distribution='normal', n_quantiles=1000, random_state=42)
# Apply transformations
data_uniform = qt_uniform.fit_transform(data_original)
data_normal = qt_normal.fit_transform(data_original)
# Create DataFrame for easier plotting
df = pd.DataFrame({
'Original': data_original.flatten(),
'Uniform Quantile': data_uniform.flatten(),
'Normal Quantile': data_normal.flatten()
})
# --- Visualization ---
fig = make_subplots(rows=1, cols=3, subplot_titles=('Original Exponential Data', 'Uniform Quantile Transformed', 'Normal Quantile Transformed'))
fig.add_trace(go.Histogram(x=df['Original'], name='Original', marker_color='#4dabf7'), row=1, col=1)
fig.add_trace(go.Histogram(x=df['Uniform Quantile'], name='Uniform', marker_color='#38d9a9'), row=1, col=2)
fig.add_trace(go.Histogram(x=df['Normal Quantile'], name='Normal', marker_color='#be4bdb'), row=1, col=3)
fig.update_layout(
title_text='Effect of Quantile Transformation on Skewed Data',
bargap=0.1,
showlegend=False,
height=350,
margin=dict(l=20, r=20, t=60, b=20)
)
# Display the Plotly chart JSON
# print(fig.to_json()) # You would run this in your environment
The distribution of the original exponential data.
The distribution after uniform quantile transformation. Notice how the values are spread out evenly.
The distribution after normal quantile transformation. The data now resembles a Gaussian shape.
Quantile transformation offers several advantages and important considerations:
fit_transform
: Like other scalers, it's important to apply fit
only on the training data and then transform
both training and test data. Applying fit_transform
directly to the entire dataset (including test data) before splitting can lead to data leakage, where information from the test set implicitly influences the training process, resulting in overly optimistic performance estimates.Consider using quantile transformation when:
While a powerful tool, quantile transformation can sometimes make the model's interpretability more challenging, as the transformed values no longer have a direct, linear relationship to the original scale. However, for many predictive modeling tasks, the improved model performance often outweighs this drawback.
Was this section helpful?
© 2025 ApX Machine Learning