DNA biobanks have revolutionized the study of the genetics of human disease, but verifying data quality across biobanks isn’t straightforward, and the field lacks systematic tools to test the accuracy of disease information.
In The American Journal of Human Genetics, Lisa Bastarache, MS, Josh Peterson, MD, MPH, and colleagues report replication experiments to assess the quality of biobank data. They show that data from four independent biobanks robustly replicates phenotype-genotype associations for hundreds of diseases. When this data was artificially corrupted, the replication rate plummeted, indicating that replication rates can be used to assess data quality.
The authors created a phenotype-genotype reference map to help researchers conduct their own replication experiments. Using the genome-wide association catalog, the team curated thousands of ancestry-based gene-disease associations deemed susceptible to replication. The current map contains 5,879 genetic associations with 149 diseases in 13 disease categories. They show how replication can be used to assess data quality, compare different methods of defining phenotypes, and explore factors that influence replicability of genetic associations.
Others on the study from the Department of Biomedical Informatics at Vanderbilt University Medical Center include Sarah Delozier, Jing He, MS, Adam Lewis, Robert Carroll, PhD, and Jacob Hughey, PhD. The National Institutes of Health — LM010685, TR002243 — supported the research.