Machine learningPrivacy-preserving analysis

Synthetic Data Generation for Disclosure Control

Also known as: Fully Synthetic Data, Partial Synthetic Data, Statistical Data Synthesis, Sentetik Veri Üretimi

Synthetic data generation is a statistical disclosure limitation technique introduced by Donald Rubin in 1993, in which values in a confidential dataset are replaced by draws from a fitted posterior predictive distribution rather than released directly. The resulting artificial records preserve the joint statistical structure of the original data while preventing the identification of real individuals, enabling analysts to work with a publicly releasable dataset that behaves like the original for most inferential purposes.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Synthetic Data Generation

Differential Privacy Generative Adversarial N…Multiple Imputation Disclosure Risk Assessme…k-Anonymity

When to use it

Use synthetic data generation when a dataset contains sensitive personal or commercially confidential information that legally or ethically cannot be shared in its original form, yet releasing some version of the data would benefit research or policy analysis. It is well-suited to large, complex microdata files where simple masking techniques would destroy analytical utility. Key assumptions are that the synthesis model captures the relevant joint distribution and that the intended analysis is consistent with what the synthesizer was designed to support. It is a poor fit when the research question depends on individual-level records being authentic, when the synthesis model is badly misspecified, or when adversarial re-identification attacks exploiting subtle distributional artifacts are a serious concern. Alternatives include differential privacy, data enclaves, and query-based access systems.

Strengths & limitations

Strengths

Releases a fully self-contained dataset that analysts can use with standard software and workflows without special infrastructure.
Grounded in a rigorous Bayesian multiple-imputation framework, providing valid variance estimation when combining rules are applied correctly.
Can reproduce complex multivariate relationships, rare subgroup distributions, and longitudinal structures that simple masking methods distort.
Scales naturally to high-dimensional microdata and can incorporate auxiliary public data to improve the fidelity of the synthesizer.

Limitations

Inferential validity depends critically on correct specification of the synthesis model; a misspecified model produces synthetic data with biased distributions that mislead analysts.
Combining rules require knowledge of whether data are fully or partially synthetic and how many replicates were produced, details that must accompany the release.
Generating high-fidelity synthetic data for variables with extreme skewness, sparse categories, or complex hierarchical structures remains technically demanding.
Membership inference and attribute disclosure attacks have shown that synthetic datasets are not unconditionally private; formal privacy guarantees require additional mechanisms such as differential privacy.

Frequently asked

How many synthetic replicates (M) do I need?

Rubin's combining rules require at least M = 2 replicates to estimate between-replicate variance. In practice, M = 5 to 20 is standard: smaller M inflates variance estimates and reduces statistical power, while very large M offers diminishing returns. The optimal choice depends on how much of the total variance is attributable to synthesis uncertainty relative to within-replicate sampling variability.

Is synthetic data the same as anonymized data under GDPR?

Regulators generally treat synthetic data cautiously. If the synthesis model memorizes or closely approximates individual records, the output may still be considered personal data. True anonymization under GDPR requires that re-identification be reasonably impossible; synthetic data meets this bar only when provably low-membership-inference risk is demonstrated, often through formal differential privacy guarantees layered on top of the synthesis procedure.

Can I use deep generative models instead of Bayesian regression synthesizers?

Yes. GANs, VAEs, and diffusion models have been applied as synthesizers and can capture highly nonlinear dependencies that parametric models miss. However, they lack the principled posterior-sampling interpretation underpinning Rubin's combining rules, so additional work is needed to quantify synthesis uncertainty and ensure that releasing M replicates from a GAN yields valid inferences rather than M near-identical copies of the same generated distribution.

Sources

Rubin, D. B. (1993). Statistical disclosure limitation. Journal of Official Statistics, 9(2), 461–468. link ↗

How to cite this page

ScholarGate. (2026, June 2). Synthetic Data Generation for Disclosure Control. ScholarGate. https://scholargate.app/en/privacy/synthetic-data-generation

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Differential PrivacyPrivacy↔ compare
Generative Adversarial NetworkDeep learning↔ compare
Multiple ImputationStatistics↔ compare

Compare side by side →

Referenced by

Differential Privacy Disclosure Risk Assessment k-Anonymity

Related reference concepts

Reproducible Research Statistical Methods in Evidence Synthesis Missing Data and Attrition EM Algorithm Deep Generative Models Empirical Bayes Methods

Spotted an issue on this page? Report or suggest a fix →

Synthetic Data Generation for Disclosure Control

Also known as: Fully Synthetic Data, Partial Synthetic Data, Statistical Data Synthesis, Sentetik Veri Üretimi

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

When to use it

Strengths & limitations

Strengths

Releases a fully self-contained dataset that analysts can use with standard software and workflows without special infrastructure.
Grounded in a rigorous Bayesian multiple-imputation framework, providing valid variance estimation when combining rules are applied correctly.
Can reproduce complex multivariate relationships, rare subgroup distributions, and longitudinal structures that simple masking methods distort.
Scales naturally to high-dimensional microdata and can incorporate auxiliary public data to improve the fidelity of the synthesizer.

Limitations

Inferential validity depends critically on correct specification of the synthesis model; a misspecified model produces synthetic data with biased distributions that mislead analysts.
Combining rules require knowledge of whether data are fully or partially synthetic and how many replicates were produced, details that must accompany the release.
Generating high-fidelity synthetic data for variables with extreme skewness, sparse categories, or complex hierarchical structures remains technically demanding.
Membership inference and attribute disclosure attacks have shown that synthetic datasets are not unconditionally private; formal privacy guarantees require additional mechanisms such as differential privacy.

Frequently asked

How many synthetic replicates (M) do I need?

Is synthetic data the same as anonymized data under GDPR?

Can I use deep generative models instead of Bayesian regression synthesizers?

Sources

Rubin, D. B. (1993). Statistical disclosure limitation. Journal of Official Statistics, 9(2), 461–468. link ↗

How to cite this page

ScholarGate. (2026, June 2). Synthetic Data Generation for Disclosure Control. ScholarGate. https://scholargate.app/en/privacy/synthetic-data-generation

Synthetic Data Generation for Disclosure Control

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts

Synthetic Data Generation for Disclosure Control

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts