Now that we've discussed various theoretical approaches for assessing statistical fidelity beyond simple univariate measures, let's put these ideas into practice. This section provides hands-on examples using Python to implement several multivariate tests. We'll compare simulated "real" and "synthetic" datasets to see how these tests work and how to interpret their results.
Remember from Chapter 1 that you should have a Python environment ready with libraries like NumPy, SciPy, Pandas, and optionally Scikit-learn and Matplotlib/Seaborn.
First, let's import the necessary libraries and generate some simple multivariate data. We'll create a "real" dataset drawn from a multivariate normal distribution and a "synthetic" dataset that might be slightly different (e.g., different mean, covariance, or distribution type) to illustrate the tests.
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.metrics.pairwise import rbf_kernel
import plotly.graph_objects as go
import json # For embedding plotly json
# Set random seed for reproducibility
np.random.seed(42)
# Generate 'Real' Data (e.g., 2D Normal Distribution)
mean_real = [0, 0]
cov_real = [[1, 0.5], [0.5, 1]] # Variables are correlated
n_samples_real = 200
real_data = np.random.multivariate_normal(mean_real, cov_real, n_samples_real)
real_df = pd.DataFrame(real_data, columns=['Feature1', 'Feature2'])
# Generate 'Synthetic' Data (Slightly different mean and covariance)
mean_synth = [0.2, -0.1] # Slightly shifted mean
cov_synth = [[1.1, 0.45], [0.45, 0.9]] # Slightly different covariance
n_samples_synth = 200
synthetic_data = np.random.multivariate_normal(mean_synth, cov_synth, n_samples_synth)
synthetic_df = pd.DataFrame(synthetic_data, columns=['Feature1', 'Feature2'])
# Generate another 'Synthetic' Data (Different distribution type - e.g., uniform noise)
# This should be easily distinguishable by better tests
synthetic_data_uniform = np.random.rand(n_samples_synth, 2) * 4 - 2 # Uniform between -2 and 2
synthetic_df_uniform = pd.DataFrame(synthetic_data_uniform, columns=['Feature1', 'Feature2'])
print("Real Data Sample:")
print(real_df.head())
print("\nSynthetic Data (Normal) Sample:")
print(synthetic_df.head())
print("\nSynthetic Data (Uniform) Sample:")
print(synthetic_df_uniform.head())
Now we have three datasets: real_df
, synthetic_df
(similar distribution), and synthetic_df_uniform
(very different distribution). Let's apply some tests.
As discussed earlier, simply comparing means feature by feature ignores correlations. Hotelling's T² test compares the mean vectors of two multivariate samples. It's a multivariate generalization of the t-test. The null hypothesis (H0) is that the two samples have the same mean vector.
This test assumes multivariate normality and equal covariance matrices between the groups, though it has some robustness, especially with larger samples. Libraries like scipy
don't have a direct implementation, but we can use external libraries or implement the core idea. A simpler approach for this hands-on is often found in specialized stats packages or can be approximated.
For demonstration purposes, let's imagine a hypothetical function hotelling_t2(X, Y)
that returns the T² statistic and the p-value.
# Hypothetical function call (implementation often requires dedicated libraries
# or manual calculation based on inverse covariance matrices)
# T2_statistic, p_value = hotelling_t2(real_data, synthetic_data)
# print(f"Hotelling T2 Test (Real vs. Synth Normal): Statistic={T2_statistic:.4f}, p-value={p_value:.4f}")
# T2_statistic_uniform, p_value_uniform = hotelling_t2(real_data, synthetic_data_uniform)
# print(f"Hotelling T2 Test (Real vs. Synth Uniform): Statistic={T2_statistic_uniform:.4f}, p-value={p_value_uniform:.4f}")
# Since scipy doesn't directly provide it, we will skip the execution
# but describe the interpretation.
# Interpretation:
# A small p-value (e.g., < 0.05) suggests rejecting H0, meaning there's a
# statistically significant difference between the mean vectors.
# For our 'real_data' vs 'synthetic_data', we might expect a p-value > 0.05
# due to the small difference in means relative to variance.
# For 'real_data' vs 'synthetic_data_uniform', we'd expect a very small p-value
# indicating significantly different mean vectors.
While powerful for comparing means, Hotelling's T² doesn't tell us about the overall distribution shape or higher-order moments.
MMD is a non-parametric metric that computes a distance between the embeddings of two distributions in a Reproducing Kernel Hilbert Space (RKHS). A smaller MMD value indicates greater similarity between the distributions. It's particularly useful because it doesn't make strong assumptions about the underlying data distributions.
We can calculate an empirical estimate of MMD using a kernel function, commonly the Radial Basis Function (RBF) kernel: k(x,y)=exp(−γ∣∣x−y∣∣2).
def mmd_rbf(X, Y, gamma=1.0):
"""
Calculates the Maximum Mean Discrepancy (MMD) between two samples X and Y
using the Radial Basis Function (RBF) kernel.
Args:
X (np.ndarray): First sample (n_samples_x, n_features)
Y (np.ndarray): Second sample (n_samples_y, n_features)
gamma (float): RBF kernel parameter.
Returns:
float: The computed MMD statistic (squared).
"""
if X.shape[1] != Y.shape[1]:
raise ValueError("Samples X and Y must have the same number of features.")
K_XX = rbf_kernel(X, X, gamma=gamma)
K_YY = rbf_kernel(Y, Y, gamma=gamma)
K_XY = rbf_kernel(X, Y, gamma=gamma)
n = K_XX.shape[0]
m = K_YY.shape[0]
# Use unbiased estimator form U-statistic
# Term 1: Sum of K(xi, xj) for i != j
term1 = (K_XX.sum() - np.diag(K_XX).sum()) / (n * (n - 1)) if n > 1 else 0
# Term 2: Sum of K(yi, yj) for i != j
term2 = (K_YY.sum() - np.diag(K_YY).sum()) / (m * (m - 1)) if m > 1 else 0
# Term 3: Sum of K(xi, yj)
term3 = K_XY.sum() / (n * m) if n > 0 and m > 0 else 0
mmd_squared = term1 + term2 - 2 * term3
# MMD statistic is often reported as squared value, ensure non-negative due to float errors
return max(0, mmd_squared)
# Choose a gamma. This can be heuristics like 1 / (median pairwise distance)
# Or simply a default value. Let's use a simple default for illustration.
gamma_val = 1.0 / real_data.shape[1] # Rule of thumb: 1 / n_features
mmd_val_normal = mmd_rbf(real_data, synthetic_data, gamma=gamma_val)
mmd_val_uniform = mmd_rbf(real_data, synthetic_data_uniform, gamma=gamma_val)
print(f"\nMMD (Real vs. Synth Normal): {mmd_val_normal:.6f}")
print(f"MMD (Real vs. Synth Uniform): {mmd_val_uniform:.6f}")
# Interpretation:
# MMD values are distances; smaller is better. 0 indicates identical distributions (empirically).
# We see a much smaller MMD for the synthetic data generated from a similar normal
# distribution compared to the synthetic data from a uniform distribution.
# This aligns with our expectations. To assess significance, permutation tests are often used
# (shuffling data between X and Y to build a null distribution for MMD), but the
# relative magnitudes here are already informative.
MMD gives a good sense of overall distributional similarity, capturing differences beyond just the mean.
Preserving the relationships between variables is often as important as matching individual variable distributions. We can compare the covariance matrices (Σ) of the real and synthetic datasets.
# Calculate covariance matrices
cov_matrix_real = np.cov(real_data, rowvar=False)
cov_matrix_synth = np.cov(synthetic_data, rowvar=False)
cov_matrix_synth_uniform = np.cov(synthetic_data_uniform, rowvar=False)
print("\nCovariance Matrix (Real):")
print(cov_matrix_real)
print("\nCovariance Matrix (Synth Normal):")
print(cov_matrix_synth)
print("\nCovariance Matrix (Synth Uniform):")
print(cov_matrix_synth_uniform)
# Calculate the difference using Frobenius norm
diff_norm_normal = np.linalg.norm(cov_matrix_real - cov_matrix_synth, 'fro')
diff_norm_uniform = np.linalg.norm(cov_matrix_real - cov_matrix_synth_uniform, 'fro')
print(f"\nFrobenius Norm of Covariance Difference (Real vs. Synth Normal): {diff_norm_normal:.4f}")
print(f"Frobenius Norm of Covariance Difference (Real vs. Synth Uniform): {diff_norm_uniform:.4f}")
# Interpretation:
# The Frobenius norm gives a single number representing the magnitude of the difference
# between the two matrices. A smaller norm indicates more similar covariance structures.
# As expected, the difference is smaller for the synthetic data generated with similar
# parameters compared to the uniform data, which has a fundamentally different covariance structure
# (close to diagonal for uniform if features are independent, and scaled differently).
# This metric confirms that the relationships (covariance) are better preserved in
# `synthetic_data` than in `synthetic_data_uniform`.
More formal tests like Box's M test exist for statistically comparing covariance matrices but are sensitive to non-normality. The Frobenius norm provides a practical, direct measure of difference.
Sometimes, a plot is worth a thousand statistics. Visualizing the data can provide immediate insights into how well the synthetic data captures the multivariate structure. A scatter plot is effective for 2D data.
# Prepare data for Plotly
trace_real = go.Scatter(
x=real_df['Feature1'],
y=real_df['Feature2'],
mode='markers',
name='Real Data',
marker=dict(color='#1c7ed6', size=5, opacity=0.6) # Blue
)
trace_synth_normal = go.Scatter(
x=synthetic_df['Feature1'],
y=synthetic_df['Feature2'],
mode='markers',
name='Synthetic (Normal)',
marker=dict(color='#f76707', size=5, opacity=0.6) # Orange
)
trace_synth_uniform = go.Scatter(
x=synthetic_df_uniform['Feature1'],
y=synthetic_df_uniform['Feature2'],
mode='markers',
name='Synthetic (Uniform)',
marker=dict(color='#37b24d', size=5, opacity=0.6) # Green
)
# Plot Real vs Synthetic (Normal)
layout1 = go.Layout(
title='Real vs. Synthetic (Normal) Data Distribution',
xaxis=dict(title='Feature 1'),
yaxis=dict(title='Feature 2'),
width=600, height=500,
legend=dict(x=0.01, y=0.99)
)
fig1 = go.Figure(data=[trace_real, trace_synth_normal], layout=layout1)
# Plot Real vs Synthetic (Uniform)
layout2 = go.Layout(
title='Real vs. Synthetic (Uniform) Data Distribution',
xaxis=dict(title='Feature 1'),
yaxis=dict(title='Feature 2'),
width=600, height=500,
legend=dict(x=0.01, y=0.99)
)
fig2 = go.Figure(data=[trace_real, trace_synth_uniform], layout=layout2)
# Show JSON for embedding (only showing one for brevity)
# In a live environment, you'd render these figures.
plotly_json_str = fig1.to_json()
# print(plotly_json_str) # This would output the long JSON string
# Example concise Plotly JSON for documentation:
Comparison of generated real data (blue) and synthetic data generated from a similar multivariate normal distribution (orange). The overlap and general shape appear quite similar.
If you were to plot the real_df
against synthetic_df_uniform
, the difference would be stark: the real data forms an elliptical cloud, while the uniform synthetic data forms a square. For higher dimensions, consider using pair plots (seaborn.pairplot
) or dimensionality reduction techniques (like PCA or t-SNE) before plotting.
This practical exercise demonstrated how to implement and interpret several multivariate statistical tests for fidelity assessment:
Running these tests provides a much more comprehensive picture of statistical fidelity than univariate comparisons alone. You saw how the synthetic data designed to be similar (synthetic_df
) scored better (lower MMD, lower covariance difference) than the deliberately different uniform data (synthetic_df_uniform
). In practice, you would apply these tests to your actual synthetic data and compare the results against the original dataset to quantify how well the statistical properties have been preserved. Remember to consider the context: the required level of statistical fidelity depends heavily on the intended use case for the synthetic data.
© 2025 ApX Machine Learning