Abstract
Differentially-private synthetic data enables data sharing without compromising individual privacy, but DP-SGD adds noise that can destroy utility when training data is scarce. We map a sharp transition in DP-SGD performance for an MLP variational autoencoder across six tabular datasets from healthcare, census and ecology. The ratio of training samples to encoded dimensions (N/d) predicts viability. Below N/d ≈ 50 the synthetic data carries no usable signal. Above N/d ≈ 300 utility consistently approaches non-private baselines. The cost of stricter privacy is sublinear. On Adult, ε = 1 needs about 2.5× the data of ε = 10. On NHANES, ε = 1 never reaches viability with the data available. Within the MLP VAE family, parameter count does not move the boundary on Adult. Other DP mechanisms such as the marginal-based AIM can be viable two orders of magnitude lower, so these boundaries are specific to DP-SGD.
Key findings
- The N/d ratio predicts viability for DP-SGD. Across six tabular datasets from healthcare, census and ecology, the ratio of training samples to encoded feature dimensions explains when DP-SGD synthesis works. Below N/d ≈ 50 every dataset is useless. Above N/d ≈ 300 every dataset is viable.
- The transition is sharp. Below threshold, downstream TSTR-AUC is near-random. Above it, performance recovers most of the non-private baseline signal.
- Privacy is sublinearly expensive where it works. On Adult, tightening ε from 10 to 1 needs about 2.5× more data. That is well under the DP-ERM bound of 4.6× and the per-step SNR estimate of about 20×. On NHANES, ε = 1 never reaches viability with the data available, so the 2.5× figure is dataset-dependent.
- Parameter count does not move the boundary inside the MLP VAE family. Three MLP VAEs spanning a 14× parameter range transition in the same N bracket on Adult. The paper does not vary architecture family.
- Reduce d before chasing N. When N/d sits below threshold, dimensionality reduction is at least as valuable as gathering more data. Binning ICD-9 codes on Diabetes-130 from 2,475 to 213 features is what made DP-SGD viable on that dataset.
Why it matters
DP-synthetic data is often pitched as a route around data-sharing constraints in healthcare. The field has lacked a clean way to predict when DP-SGD will produce useful output, so practitioners burn compute on configurations that have no chance of working. For DP-SGD with MLP VAEs the viability boundary turns this into a quick check. Compute N/d, compare to threshold. Below it, expect collapse. Switch mechanisms, reduce d, or both.
The sublinear privacy cost matters for the budgets allowed by different regulatory regimes. Theory predicts ε = 1 should be much more expensive than ε = 10. On Adult the empirical cost is about 2.5×, comfortably below the DP-ERM bound of 4.6×. On NHANES even ε = 2 sits at the edge. The boundary itself is also mechanism-specific. Marginal-based methods like AIM reach viability at N/d below 1 on Adult, around two orders of magnitude lower than DP-SGD. If DP-SGD is collapsing, the answer is sometimes a different DP mechanism, not more data.
Scope and limitations
The cross-dataset sweep maps the boundary for an MLP variational autoencoder trained with DP-SGD via Opacus. The ε sweep covers Adult, ACS Income and NHANES. The model-size and dimension-reduction experiments each cover one dataset. The viability threshold of 50% of the non-private signal is a chosen cutoff, not a fundamental quantity. TSTR-AUC is the sole utility metric. PATE-based mechanisms, pre-trained generative models and other DP-ERM families are not tested.
Cite
@inproceedings{kolbeinsson2026viability,
title = {The Viability Boundary of Differential Privacy},
author = {Kolbeinsson, Arinbj{\"o}rn and Kolbeinsson, Benedikt},
booktitle = {ICLR 2026 Workshop on Data Problems for Foundation Models (DATA-FM)},
year = {2026},
url = {https://openreview.net/forum?id=Pv3PSfaphM}
}