Tech & Health

April 4, 2019

Report seeks to streamline EHR de-identification

Over the past few decades the electronic health record (EHR) has become an object of intensive study, opening new ground in biomedical research. Natural language sections of the EHR, such as physician’s notes and health team messages, are a rich vein for research, but patient privacy considerations entail first scrubbing patient identifiers from these notes and messages. Historically, this has been accomplished through large, complex software systems that are expensive to develop and maintain.

 

by Paul Govern

Over the past few decades the electronic health record (EHR) has become an object of intensive study, opening new ground in biomedical research. Natural language sections of the EHR, such as physician’s notes and health team messages, are a rich vein for research, but patient privacy considerations entail first scrubbing patient identifiers from these notes and messages. Historically, this has been accomplished through large, complex software systems that are expensive to develop and maintain.

Bradley Malin, PhD

A recent paper by researchers at Vanderbilt University and Vanderbilt University Medical Center sets out a solution to streamline EHR de-identification. The report, by Muqun (Rachel) Li, PhD, Bradley Malin, PhD, and colleagues, received the Best Data Science Paper Award at the American Medical Informatics Association 2019 Informatics Summit, which concluded March 28 in San Francisco.

The team devised and tested a new supervised machine learning pipeline for natural language de-identification. Any such strategy requires training sets of EHR notes and messages in which any patient identifiers have been carefully tagged as such by human reviewers.

Muqun (Rachel) Li, PhD

This annotation step is where the expense-reduction opportunity lies. The new pipeline applies a variation of machine learning, called active learning, to the key task of training-data selection. Instead of end-to-end passive learning, that is, iterative random selection of records for annotation, the new pipeline prioritizes which records to learn from next. The more clever the prioritization, the less human annotation is required and the lower the cost of de-identification.

Malin, professor of Biomedical Informatics, Biostatistics and Computer Science and director of the Health Information Privacy Laboratory, and Li, formerly a doctoral student in Malin’s lab and now with the technology company Privacy Analytics, based in Ottawa, Canada, were joined in the paper by researchers from Privacy Analytics and Google. The study was begun in 2017 while Li was still with Malin’s lab.