From left, Bradley Malin, PhD, Bryan Shepherd, PhD via video call, Zhuohui Liang and Chao Yan, PhD, created artificial health records that statistically resemble those of real people living with HIV without including copies of real records. (photo by Erin O. Smith)
A team led by researchers at Vanderbilt Health used artificial intelligence (AI) to create nearly 50,000 simulated people living with HIV, as reported in Nature Communications. The team’s AI model, named Medical Longitudinal Latent Diffusion, or MeLD, adopts as a generative backbone an algorithm developed for synthetic image generation, called a diffusion transformer.
“In short, it’s a two-step process: compress records of real people living with HIV into simpler representations, capturing both plain and hidden dimensions, then use a transformer-based model to learn how clinical histories evolve over time and generate new ones,” said corresponding author Chao Yan, PhD, Research Instructor in the Department of Biomedical Informatics. “As we’ve applied it, the diffusion transformer succeeds by essentially learning the grammar of the disease course in people with HIV.”
The study of HIV is slowed by a hodgepodge of international clinical data sharing regulations. The team’s aim is to broadly aid and stimulate the study of HIV, anticipating privacy concerns by creating artificial records that statistically resemble those of real people living with HIV without including copies of real records.
“Particularly in international settings where data sharing is becoming more complicated, the rate of discovery is being impeded by sensitivities around HIV and legitimate privacy concerns,” said HIV/AIDS researcher Bryan Shepherd, PhD, Professor of Biostatistics and Biomedical Informatics and corresponding author. “High-quality simulation attempts to address this difficulty.”
The team began with records from CCASAnet, a U.S. National Institutes of Health-funded consortium that harmonizes clinical care data from HIV clinics in Argentina, Brazil, Chile, Haiti, Honduras, Mexico and Peru. Data used by the team included up to 36 years of follow-up on some 49,600 people living with HIV. The team had to overcome challenges posed by the chronic nature of HIV, variability in record length, diverse clinical trajectories, and high missingness (gaps where measurements weren’t recorded).
Prior synthetic-patient work has shown a utility-versus-privacy tradeoff, where the best performing generator leaks the most private information.
“MeLD breaks that pattern, topping nearly every utility and fidelity metric while keeping low privacy risk,” said lead author, Zhuohui Liang, MS, a PhD candidate in Biostatistics.
The team benchmarked MeLD against five other state-of-the-art methods, testing how faithfully each one’s synthetic people reproduced the real cohort’s survival patterns, statistical properties, and disease-prediction signals.
They found:
- MeLD reproduced known mortality risk-factor relationships far more reliably than other methods tested.
- MeLD’s survival curves matched the real data most closely.
- MeLD generally led on the statistical-resemblance metrics.
“While sharing observational datasets is essential to enabling new hypotheses leading to potential cures, typical de-identification methods tend to reduce data fidelity and research utility, and their privacy methods may or may not pass muster with international groups,” said corresponding author Bradley Malin, PhD, an expert on computational aspects of privacy, Professor of Biomedical Informatics, Computer Science, and Biostatistics and holder of the Accenture Chair. “Our study shows data simulation to be a powerful alternative.”
Others from Vanderbilt on the study include Zhuohang Li, PhD, Nicholas Jackson, Kevin Guo, Jessica Perkins, PhD, Amir Asiaee, PhD, and Stephany Duda, PhD. They were joined by researchers from CCASAnet institutions in Latin America and Switzerland. The study was supported by the National Institutes of Health (R01MH139379, U01AI069923, K99LM014428, P30AI110527).