As discussed, evaluating synthetic data isn't a single check but a multi-faceted process touching upon statistical similarity, practical usefulness, and privacy assurance. Given the variety of potential checks, from simple statistical comparisons to complex attack simulations, organizing them is necessary for a systematic evaluation. A practical way to structure these metrics is by the type of assessment they perform, which generally aligns with the core quality dimensions: fidelity, utility, and privacy.
We can broadly group evaluation metrics into the following categories:
Statistical Fidelity Metrics
These metrics quantify how closely the statistical properties of the synthetic dataset (Dsyn) mirror those of the real dataset (Dreal). They directly assess the "likeness" of the generated data distribution to the original one.
- Marginal Distribution Comparisons: These are often the first step, comparing the distribution of individual columns (features) between Dreal and Dsyn. Common techniques include:
- Visualizations: Overlaying histograms or density plots for each feature.
- Summary Statistics: Comparing mean, median, variance, quantiles, etc.
- Statistical Tests: Applying tests like the Kolmogorov-Smirnov (KS) test or Shapiro-Wilk test (for normality checks) to individual features. While simple, these univariate checks don't capture relationships between features.
- Multivariate Distribution Comparisons: Assessing the joint distribution and inter-feature dependencies is more informative but also more complex. Methods include:
- Correlation Analysis: Comparing correlation matrices (e.g., using Frobenius norm difference) between Dreal and Dsyn.
- Dimensionality Reduction: Applying techniques like Principal Component Analysis (PCA) to both datasets and comparing the resulting principal components or the distribution of data in the reduced space.
- Advanced Statistical Tests: Utilizing multivariate hypothesis tests (e.g., Hotelling's T-squared test for means, tests based on maximum mean discrepancy or energy statistics) to compare the overall distributions. We will examine these in detail in Chapter 2.
- Propensity Scores: Training a classifier to distinguish between real and synthetic samples. A classifier performing near random chance (e.g., AUC close to 0.5) indicates high similarity.
Machine Learning Utility Metrics
These metrics evaluate the practical value of the synthetic data for training downstream machine learning models. The central question is: Can a model trained on Dsyn perform comparably to a model trained on Dreal when evaluated on unseen real data?
- Train-Synthetic-Test-Real (TSTR): This is a standard protocol. Train a specific ML model (e.g., logistic regression, random forest, neural network) on Dsyn and evaluate its performance (accuracy, F1-score, AUC, etc.) on a held-out portion of Dreal. Compare this performance to a baseline model trained and tested on different portions of Dreal.
- Train-Real-Test-Synthetic (TRTS): A complementary approach where a model is trained on Dreal and tested on Dsyn. This can help identify if the synthetic data contains unrealistic patterns or fails to represent certain aspects of the real data distribution that the model learned.
- Downstream Model Analysis: Beyond just performance scores, utility can be assessed by comparing:
- Feature importance rankings derived from models trained on Dreal vs. Dsyn.
- Model coefficients or decision boundaries.
- Behavior during hyperparameter optimization.
Chapter 3 focuses on implementing and interpreting these utility assessments.
Privacy Metrics
These metrics aim to quantify the risk of sensitive information about individuals in Dreal being leaked through Dsyn. This is distinct from fidelity and utility; highly accurate synthetic data might inadvertently pose a high privacy risk.
- Membership Inference Attacks (MIAs): Attempting to determine if a specific record from Dreal was used to train the generative model by analyzing Dsyn. Success rates above random guessing indicate potential privacy leakage.
- Attribute Inference Attacks: Trying to predict sensitive attributes of a real individual given some of their other attributes (present in Dsyn) and access to the synthetic dataset.
- Distance-Based Metrics: Calculating distances between synthetic records and their nearest neighbors in the real dataset. Very small distances (especially for identical matches) can signal privacy issues like data copying.
- Differential Privacy (DP) Guarantees: If the generative model incorporated DP mechanisms, the privacy loss parameters (like ϵ and δ) provide a formal, quantifiable measure of privacy protection.
We will explore methods for implementing and interpreting these privacy assessments in Chapter 4.
Domain-Specific Metrics
The metrics above are largely applicable to tabular data. However, different data modalities often require specialized evaluation techniques.
- Image Data: Metrics like Fréchet Inception Distance (FID), Inception Score (IS), and Precision/Recall assess the quality and diversity of generated images based on features extracted from pre-trained deep learning models.
- Text Data: Evaluating generated text involves metrics like Perplexity (measuring model uncertainty), BLEU scores (comparing generated text to reference texts), and semantic similarity scores.
- Time-Series Data: Assessment focuses on capturing temporal dependencies, using metrics like autocorrelation function (ACF) similarity, spectral density comparisons, and performance on downstream forecasting tasks.
Chapter 5 covers these specialized metrics in greater detail.
A categorization of synthetic data evaluation metrics based on the assessment goal.
Understanding this taxonomy provides a roadmap for comprehensive evaluation. Rarely is a single metric sufficient. Instead, a suite of metrics across these categories is typically needed to gain a holistic view of the synthetic data's quality, reflecting the inherent trade-offs between fidelity, utility, and privacy that we introduced earlier. The specific choice of metrics within this framework will depend heavily on the data type, the generative model used, and the intended application of the synthetic data. In the following chapters, we will implement and interpret metrics from each of these categories.