A multicenter study reported April 6 in eBioMedicine tests large language models for detecting immune-related adverse events using models from Open AI, headquartered in San Francisco. (iStock)

Drug safety signals often lurk in clinical notes and other text in electronic health records (EHR). Finding them requires costly manual chart abstraction or natural language processing with software tuned to a specific drug as documented at a specific center. This is the case with immune checkpoint inhibitors (ICIs), a type of cancer drug which first came to market in 2011, posing various immune-related adverse events, or irAEs, affecting the colon, liver, lungs, heart, nervous system, skin, and endocrine system.
Large language models (LLMs), an increasingly inescapable form of artificial intelligence (AI), have come under study as a solution for speeding identification of drug safety signals buried in text. A multi-center study reported April 6 in eBioMedicine tests LLMs for detecting irAEs using models from Open AI, headquartered in San Francisco.
The investigators use so-called zero-shot learning, where an LLM is given a single detailed prompt with no examples. The prompt developed by the team — “You are a clinical expert in identifying immune-related adverse events caused by immune checkpoint inhibitors …” — includes a list of six ICIs and dozens of their irAEs. The prompt is applied to randomly selected clinical notes of patients exposed to ICIs from Vanderbilt Health (100 patients) and the University of California at San Francisco (70 patients), and notes from seven ICI trials sponsored by Roche (272 patients), a pharmaceutical company based in Basel, Switzerland.
“Manual patient chart abstraction for monitoring the safety and efficacy of drugs already at market requires tremendous resources and puts a drag on the pace of discovery in precision medicine. And that’s especially true with immune checkpoint inhibitors, where the adverse events are so varied. If zero-shot learning with LLMs could help with these notes, it could significantly reduce time and costs for all concerned,” said the report’s corresponding author, Cosmin Bejan, PhD, assistant professor of Biomedical Informatics at Vanderbilt Health.
The team studied three LLMs, GPT-3.5, GPT-4, and GPT-4o, with the last providing the best performance. For the main performance measure the team uses F1 scores, which range from zero to one and are sensitive to both false positives and false negatives. An F1 score of 90% or more is considered excellent, and a predictive model with a score of 80% or above might qualify for driving automated clinical decision support.
For detection of irAEs at the patient level, average F1 scores from GPT-4o across Vanderbilt and UCSF EHRs and Roche trial notes were 56%, 66% and 62%, respectively. The models showed a systematic bias toward overpredicting irAEs. For detection of 17 irAEs at the level of single notes (with GPT-4o working on 667 notes from Vanderbilt), average F1 scores were 57%.
“These results show that zero-shot learning with a powerful LLM is useful for detecting these adverse events,” Bejan said. “This performance does not rise to the level required for clinical decision support, but the method could be valuable for automated irAE extraction across multiple sites, potentially speeding discovery and enhancing the safety and effectiveness of cancer immunotherapies.”
Others from Vanderbilt on the study include Yaomin Xu, PhD, Eric Mukherjee, MD, PhD, Matthew Krantz, MD, Douglas Johnson, MD, MSCI, Elizabeth Phillips, MD, and Justin Balko, PhD. The study was supported in part by the National Institutes of Health under awards R01CA227481 and R01HL156021.
On a related note, in a research letter last December in JAMA Oncology, Mukherjee, Phillips and colleagues, using logistic regression with adverse event reports collected by the Food and Drug Administration, confirmed that ICIs were independently associated with increased risk of the dangerous skin reaction SJS/TEN (Stevens-Johnson syndrome/toxic epidermal necrolysis) and found that this increased risk sometimes occurs in association with patient exposure to human leukocyte antigen–restricted drugs.