New data mining method reveals cancer-driving genes

Prospecting for genes that might be implicated in cancer, a Vanderbilt University Medical Center research team has struck pay dirt.

Zhongming Zhao, Ph.D., Peilin Jia, Ph.D., and colleagues use novel computational methods to sift through online repositories of molecular data gathered by cancer researchers worldwide.

Zhao and Jia DD_0177 — Peilin Jia, Ph.D., Zhongming Zhao, Ph.D., and colleagues are using novel computational methods to single out cancer-driving genes. (photo by Daniel Dubois)

In the journal Genome Biology, in a massive analysis of cancer mutation records, the team demonstrates data mining methods capable of singling out known cancer-driving genes from amidst thousands of other genes isolated from cancer tissue. More to the point, the authors use these tools to decisively brand hundreds of genes as suspected agents of cancer (“candidate cancer genes”), and they reveal several associations, hidden until now, between known cancer-driving genes and additional types of cancer.

These results, and further analyses planned by the team, can help focus laboratory and clinical research, speeding more complete understanding of the molecular basis of cancer.

According to Zhao, when cancer tissue and normal tissue from the same patient are compared, the cancer tissue is apt to contain hundreds or thousands more mutations in the coding regions of the genome alone — that is, in the genes. And while some of these mutations might be helping to drive the cancer, many others will simply be along for the ride.

“It’s not practical to test that many mutations in the lab or in clinical practice. So the idea is to somehow home in on mutations that are more apt to be clinically relevant. In that respect at least, we think these new methods are much better than previous methods,” Zhao said.

In many of our cells, the vagaries of DNA replication give rise to — among other things — DNA base insertions and deletions, together termed indels, and single base substitutions, called point mutations. Natural selection, here operating at the cellular level, is of course not indifferent to these accidents. In cancer these mutations often converge to form so-called hotspots in regions presumed to be somehow vital to the unchecked growth and success of the cancer.

The Vanderbilt team has gone in search of such hotspots. In the Cancer Gene Census, the most authoritative available count, the team saw that approximately one-third of the listed cancer-driving genes contained mutations of the sort eligible for hotspot analysis. Using two complementary methods, they found that 51 percent of these 183 genes bore significant hotspots.

Drawing on more than 840,000 mutation records from online repositories, the team proceeded to score some 18,284 eligible genes using a method called MSEA-clust (for mutation set enrichment analysis cluster). They found 947 genes with significant hotspots, including 82 known cancer genes.

This is an example of hypothesis-free, pan-cancer analysis, used previously to measure differential gene expression across different types of cancer.

“An important aspect that distinguishes our method is that we identify not only candidate cancer genes, but also the precise regions of interest within these genes,” Zhao said.

And that new level of detail presumably can aid understanding of how particular gene products might go awry in cancer.

MSEA-clust distinguishes among individual mutations with regard to their downstream consequences. The method keeps track, for example, of deleterious and non-deleterious mutations, a step that’s missing from the sole previous attempt to systematically associate hotspots with cancer. And according to Jia, for this reason the earlier hotspot analysis is much more likely to have produced considerably more false positives. “Just not as elegant,” she said.

With a second method called MSEA-domain, the Vanderbilt team looked for any disproportionately high occurrence of indels and point mutations in DNA corresponding to protein domains.

Protein domains are modular structures that pop up again and again in different configurations in different proteins.

“The idea is that domain regions are more apt to be functionally critical, and mutations that locate there may be more apt to confer abnormal cell growth,” Jia said.

They ran MSEA-domain on some 14,224 eligible genes and found 203 with significant hotspots, including 43 known cancer genes.

Other members of the team included Quan Wang, Ph.D., Qingxia Chen, Ph.D., M.S., William Pao, M.D., Ph.D., and doctoral candidate Katherine Hutchinson.

The study was supported by the National Institutes of Health (grants LM011177, CA68485, CA095103, CA098131), the American Cancer Society and the Joanna M. Nicolay Melanoma Foundation.