In 1906, English statistician Francis Galton happened to visit a livestock fair where fairgoers were invited to guess the dressed weight of an ox scheduled for imminent slaughter. Some 800 attendees took part and afterwards Galton got hold of the contest data.
This episode, which Galton reported in Nature, has become subject to popular retellings, such as in James Surowiecki’s 2004 book, “The Wisdom of Crowds.” At 1,197 pounds, the average of all the fairgoers’ guesses had zero error.
A new study from Vanderbilt University Medical Center suggests that Galton’s finding of a century ago might have implications for dermatological research and clinical evaluation.
For any number of diseases involving the skin, research into causes and cures requires isolating and quantifying in a reliable way the proportion of affected skin, one research subject after another, the more the better.
This is achieved with medical photography, computer monitors, and mouse-dragging by a research dermatologist to carefully demarcate affected areas. With areas of interest highlighted, software does the final step of quantifying the proportion of affected skin.
Massive sets of relevant medical photographs are available for research, filed away in hospitals and clinics, but “the time and expense involved in having experts endlessly pore over these images is a major impediment, and from one study or one expert to the next the consistency in the application of the relevant visual evaluation scales tends to be poor,” said Eric Tkaczyk, MD, PhD, assistant professor of Dermatology and Biomedical Engineering.
Artificial intelligence is poised to provide, at a fraction of the cost, speedy and consistent automated interpretation of such images. However, machine learning for these evaluations will require prodigious numbers of reliably annotated images, amassed as training sets.
“A solution for economically generating the needed training sets could streamline research into a host of diseases and conditions and benefit patient evaluation to boot. We wondered, particularly with today’s gig economy, what sorts of results might be achieved by giving non-experts a few pointers and letting them demarcate images in a web interface. How might pooled non-expert evaluations stack up against expert evaluation?” Tkaczyk asked.
He and Daniel Fabbri, PhD, assistant professor of Biomedical Informatics, and colleagues test this notion in a new crowdsourcing study appearing in Skin Research & Technology. They tested crowd worker evaluation of a sometimes killer, chronic graft-versus-host disease (cGVHD).
Skin is the most commonly affected organ in cGVHD, which is the leading long-term cause of morbidity and mortality (other than cancer relapse) following stem cell transplantation.
The study uses 41 3-D photographs taken of cGVHD patients. The visible burden of cGVHD in these images was first highlighted by a board-certified dermatologist with particular interest in the disease.
Seven crowd workers, in this case medical students and nurses, were given two-dimensional projections of the 3D images and were asked to emulate the expert, based on a slide presentation about cGVHD and a small set of marked-up images as guiding examples.
The researchers evaluated the crowd’s work pixel by pixel. When they threw out extremes of least or most pixels highlighted, in terms of the pixel-by-pixel match with expert evaluation, across 410 images the median accuracy of the pooled evaluations of four crowd workers was 76 percent.
“This places this group of crowd workers, as a collective, very much on a par with expert evaluation for cGVHD,” Tkaczyk said. “Our results establish that crowdsourcing could aid machine learning in this realm, which stands to benefit research and clinical evaluation of this disease.”
Joining Tkaczyk and Fabbri were Joseph Coco, MS, Jianing Wang, MS, Fuyao Chen, Cheng Ye, Madan Jagasia, MBBS, MSCI, and Benoit Dawant, PhD.
The study was supported by grants from the U.S. Department of Veterans Affairs (CX001785) and the National Institutes of Health (CA090652, CA203708).