Multimodal Alignment for Synthetic Clinical Time Series

Kolbeinsson, Arinbjörn; Kolbeinsson, Benedikt

Abstract

Synthetic ICU time series need to preserve cross-modal physiological relationships, not just marginal statistics. We compare three autoregressive conditional mean models on the PhysioNet 2019 sepsis dataset (about 40,000 ICU patients, 40 features, hourly). The three models, an AR-only baseline, ridge regression and a gradient-boosted tree, share an AR(1) residual with Ledoit–Wolf-regularised innovations. Conditional MMD improves monotonically with complexity. Clinical-rule violation rates do not. Fever co-occurring with tachycardia is over-produced by about twice the real rate, and the gap widens with model complexity. An ablation that reorders features so temperature precedes heart rate improves per-feature trend consistency but leaves the cross-feature rule essentially unchanged. The pattern suggests cross-modal failure is architectural rather than a missing-signal or ordering problem.

Key findings

Conditional MMD improves with complexity. Conditioned on gender, distance to real data drops from 0.351 (AR-only) to 0.328 (ridge) to 0.304 (GBM). Statistical similarity follows model complexity in the expected direction.
Clinical rules move the wrong way. Real ICU data shows fever co-occurring with tachycardia about 33% of the time. All three models over-produce: AR-only 57%, ridge 69%, GBM 69%. For the stricter high-fever rule the gap widens further. The training data carries a strong fever-to-HR signal (Cohen’s d = 0.84), so the failure is not from missing signal.
Reordering helps per-feature metrics, not cross-feature ones. Moving temperature ahead of heart rate substantially improves per-feature trend consistency for temperature (ridge: +0.45 to +0.75), but leaves the fever-to-tachycardia rule essentially unchanged (ridge: 0.689 to 0.697). Conditional MMD is also nearly invariant to ordering.
The pattern points to architecture. The chosen autoregressive conditional-mean class, with its AR(1) residual, captures marginals well but consistently misses joint conditional structure across modalities. This is suggestive, not proof. Three model variants and three orderings were tested on a single dataset.
Future work, not yet tested. Causal-discovery-driven orderings, and joint per-timestep generation, are the directions we point to. Neither is evaluated here.

Why it matters

Synthetic clinical time series are increasingly used to substitute for real records that cannot be shared. To be useful for any downstream clinical task they need to look real in the right ways. Statistical resemblance at the feature level is not enough. The relationships between features are where clinical meaning lives.

Three autoregressive conditional-mean models, each at a different complexity, all fail to match the real rate of fever co-occurring with tachycardia. Adding complexity widens the gap rather than narrowing it. The signal is there in the training data. The class of models tested cannot use it to predict the joint. We do not claim this is true of every autoregressive model. We do suggest that without changing the inductive bias, more capacity will not help.

Scope and limitations

A single dataset (PhysioNet/CinC Challenge 2019). One conditioning variable for MMD (gender). Ten manually defined clinical rules. Three feature orderings. No comparison to deep generative baselines such as GANs, VAEs or diffusion. No downstream prediction task is evaluated. Findings are framed as patterns and suggestions, not proofs.

Cite

@inproceedings{kolbeinsson2025multimodal,
  title     = {Multimodal Alignment for Synthetic Clinical Time Series},
  author    = {Kolbeinsson, Arinbj{\"o}rn and Kolbeinsson, Benedikt},
  booktitle = {EurIPS 2025 Workshop on Multimodal Representation Learning for Healthcare (MMRL4H)},
  year      = {2025},
  url       = {https://openreview.net/pdf?id=OrbfmFx1G8}
}

Multimodal Alignment for Synthetic Clinical Time Series.

Abstract

Key findings

Why it matters

Scope and limitations

Cite