Cancer Data Science Pulse
Winning an HPN-DREAM8 Challenge: Competition and Collaboration Supporting Open Scientific Research and Development
While working with big genomics data can be very challenging, it can also be fun. And when your team wins a competition, it's positively exhilarating, as our NCIP Computational Genomics Research Group discovered while participating in the 8th annual set of challenges posed by The Dialogue on Reverse Engineering Assessment and Methods (DREAM) project.
DREAM challenges are designed to accelerate the creation of new predictive models in biomedicinethey can, for instance, catalyze the development of innovative methods for inferring cellular networks, thereby advancing network and systems biology. As the history of DREAM computational challenges has shown, they also encourage the creative re-use of open-source applications and therefore provide strong support for critical facets of the open-science movement. Sage Bionetworks provided the infrastructure that supported the competition, which required maintaining leaderboards that were updated weekly throughout the three-month period of iterative development, evaluation, and scoring.
This year's DREAM project was sponsored by the Heritage Provider Network (HPN) and the Division of Cancer Biology at NCI. Other organizations providing support included the Netherlands Cancer Institute; the Oregon Health and Science University; the M.D. Anderson Cancer Center; the National Heart, Lung, and Blood Institute; and the Alfred P. Sloan Foundation.
Our team included Ying Hu, Chunhua Yan, Chih-Hao Hsu, Qingrong Chen, Yu Liu, George Komatsoulis, and myself. Beginning in June 2013, we took part in the first phase of the HPN-DREAM8 Breast Cancer Network Inference Challenge. We began our work by investigating all of the HPN-DREAM8 challenges in order to determine which challenge or challenges we were most likely to win. The challenges covered three facets of predictive biological modeling:
(1) Inferring causal signaling networks following the perturbation of network nodesthis challenge was subdivided into two parts: the first (a)using experimental proteomic data from breast cancer cell lines; the second (b) in-silico data
(2) Building dynamic models that predict short-term time-course trajectories of phosphorylated proteins that had been exposed to drug-induced perturbations. This challenge was similarly subdivided into the use of data derived from proteomic experiments or from in-silico simulations
(3) Visualizing high-dimensional molecular time-course data resulting from network perturbations
We chose to pursue the goal defined in Challenge 2b: predicting time-course trajectories of phosphorylated proteins using in-silico data. We were given a training set consisting of 20 phosphoproteins exposed to various stimulatory or inhibitory drugs. Developing an application to meet the challenge goal entailed several steps. First, we built a consensus network for each stimulus/inhibitor drug pair by selecting common node-edge links generated by three different algorithms. These were
- G1DBN,dynamic Bayesian network using first-order conditional dependencies (DBN)
- GeneNet, graphical Gaussian models (GM)
- bnlearn, maximum-minimum hill-climbing (MMHC)
Next, we applied the generalized linear model (glm) using the consensus networks and the time-course data to predict phosphoprotein trajectories under the influence of each drug. In the glm analysis, the dependent variables (Y) were child nodes in the network, and the independent variables (X) were parent nodes.
Protein links were then selected if they had been predicted by at least two methods and placed in the top 30-40 links by the rank scores. The scores were calculated by the protein-link frequencies in each method. The top link cutoff was estimated by the weighted score distribution. Finally, we engaged in fitting by using training data to estimate the unknown coefficients and in prediction by using the value of parents (X) to predict the value of children (Y). For each phosphoprotein pair, predicted trajectories for all stimuli were compared to the trajectories of the test data, and root mean squared error (RMSE) scores were calculated.
What made us succeed in winning Challenge 2b? The strength of our approach lay in creating a computational context in which the three algorithmsDBN, GM, and MMHCoperated synergistically to substantially reduce the false-positive rate, thereby improving overall predictive accuracy.
In addition to Challenge 2b, we entered OmicCircos, our in-house visualization tool for representing omics data in circular plots, in Challenge 3. OmicCircos is an open-source application1 written in R that we had previously made available to the wider informatics community via Bioconductor. Although we didn't win Challenge 3, the application did make a strong showing.
The second phase of the Challenge, HPN-DREAM 8.5, has begun and is expected to conclude in the summer of 2014. Its goal is to encourage collaboration among the teams that participated in the first phase of DREAM8. Competition and collaboration are two sides of the same cointhe coin being an open scientific environment in which researchers and developers use informatics to advance biomedicine and exchange ideas with as few restraints as possible.
You can see photos of our team and others participating in the DREAM8 conference at https://dream8conference.shutterfly.com/. In one picture, we're shown being presented with a check that we couldn't accept due to federal regulations. Nonetheless, the public recognition was more than gratifying. The conference took place on November 8, 2013 in Toronto.
1OmicCircos: A simple-to-use R package for the circular visualization of multidimensional omics data. Published in Cancer Informatics, January 2014.