Cancer Data Science Pulse
Joint Design of Advanced Computing Solutions for Cancer (JDACS4C): The Right Collaboration at the Right Time to Accelerate Cancer Research
This is the first of a series of posts that discuss the pilot collaborative program "Joint Design of Advanced Computing Solutions for Cancer (JDACS4C)" being pursued by the National Cancer Institute (NCI) and the Department of Energy (DOE). Investigators from NCI and the Frederick National Laboratory for Cancer Research have been working collaboratively with experts in computational, data, and physical sciences from various national laboratories maintained by the DOE: principally Argonne, Los Alamos, Lawrence Livermore, and Oak Ridge. Their main goal is to collaboratively develop, demonstrate, and disseminate advanced computational capabilities to seek answers to driving scientific questions that increase our understanding of cancer biology, risk identification, pre-clinical screening, and treatment challenges. These efforts frame forward-looking approaches for integrating and analyzing large, heterogeneous data collections with advanced computational modeling and simulation that will accelerate cancer research. Subsequent posts will discuss first-year accomplishments in the context of computational methods and algorithms that have been developed to address the specific needs of scientists and clinicians working on research pilot projects at the cellular, molecular, and population levels. Together, these pilots are intended to pioneer new approaches to research and establish a foundation for the eventual translation of research findings into the clinic that can be adopted, transformed, and scaled to address new research questions. The broader cancer community, and most importantly patients, will benefit when the current, emerging exascale computing capabilities become more widely accessible and transition to mainstream use in the future. To exemplify this process, NCI and the DOE are using JDACS4C insights and accomplishments to inform the creation of larger informatics ecosystems in which the scientific community can leverage the most advanced forms of high-performance computing including the Cancer Distributed Learning Environment (CANDLE) at NCI and the Collaboration of Oak Ridge, Argonne, and Livermore (CORAL) under the auspices of the DOE.
One of the primary goals of JDACS4C is to attain a greater understanding of cancer biology, diagnostics, prognostics, and treatment, utilizing increasingly available large-scale multimodal data analysis combined with advanced computational methods and algorithms targeting exascale computing technologies. These methods include data wrangling and mining, natural-language processing, machine or deep learning, predictive modeling, simulation, and uncertainty quantification as well as other means of ensuring reproducibility, all of which will lead to the formulation of better risk assessment and treatments for cancer patients and improvements in clinicians' ability to make the best possible treatment decisions. In short, the program's goal is to transform predictive oncology into a reality for everyone at risk of or experiencing any form of the disease.
The most advanced computers today operate well into the petascale range. Ascending from petascale to exascale represents an incredible thousand-fold increase of computational capability: petascale environments can perform over a quadrillion (1015) FLoating-point OPerations per Second (FLOPS), while exascale systems will be able to execute a quintillion (1018) FLOPS. This form of high-performance computing can grow to these enormous scales by harnessing the power of hundreds to thousands of computer nodes working together to analyze and model complex biological systems.
A much more profound understanding of fundamental cancer systems biology behavior at multiple scales, including molecular, cellular, multi-cellular, and tumor, will result from these increases in computational capacity and capabilities. Exascale computing will provide resources at the scale needed to refine models that simulate across scales, revealing new understanding for the impacts of protein modifications, genomic aberrations, epigenomic changes, complex structural dynamics, transport, and signaling on disease processes, including initiation, progression, and metastasis.
NCI and DOE are jointly pioneering advances in both informatics and information technology that simultaneously advance our understanding of cancer biology and its clinical applications. With both driving scientific questions and complex computing challenges, the collaborative principle employed is known as "co-design" or "co-development." It is a team-oriented, iterative approach in which advances in technical capability and biological knowledge mutually inform each other. The close synergy between the new biological insights and technical advances create paths leading to further discoveries, questions, hypotheses, and computational solutions employing deep learning to construct predictive models that can be validated and reproduced.
Capitalizing upon the co-design principle, three pilot efforts were defined to explore the frontiers where exascale computing capabilities and computational approaches join cancer research priorities at the molecular scale, in cellular model development, and in cancer surveillance. Already, these co-design efforts are providing tremendous insight into the future for computational and predictive oncology, with each being the subject of future blog posts.
The promise held by the "deep learning" approach to gain insights from the flood of new data and information being generated is already having significant, even transformative, effects on cancer research. The CANcer Distributed Learning Environment(CANDLE) project, a DOE-supported exascale computing project being co-designed and developed within the collaboration, delivers deep-learning capabilities throughout the JDACS4C collaboration. CANDLE is now available to intramural investigators through Biowulf, the cluster of interconnected computing nodes that provides high-performance services to intramural investigators across the National Institutes of Health (NIH). The code is also available via GitHub, to any researchers able to set up their own instances of the environment.
The most obvious practical benefit of exascale computing will be the increase in computational capabilities relative to the largest systems used for scientific research operating today. These capabilities will prove invaluable for the cancer research community in its efforts to manage, analyze, and interpret the exceptionally large volumes of data required to develop predictive models of complex biological systems that are inherently stochastic‚Äîthat is, random‚Äîin nature. As these capabilities and technologies translate to broader access, very large data collections that today require weeks and months to analyze, simulations that require months to run, and models that take years to develop and validate will instead require only a few hours, days, or weeks at most. These increases in capability and capacity will accelerate basic, translational, clinical, and population-level investigations to the point where achieving the central goal of the NCI Cancer MoonshotSM‚Äîaccomplishing ten years' worth of research in five years‚Äîwill come within reach.
JDACS4C has been inspired and made possible by factors converging at this moment in history. There are unique opportunities for unprecedented advances in our understanding of cancer biology and oncology. Simultaneously, there are new capabilities in the areas of data management and analytics, natural-language processing, deep learning, predictive modeling and simulation, verification, validation, and uncertainty quantification. These converging factors include:
- Genomic Data Commons Presently dominated by molecular-level data generated by pan-omics approaches and high-throughput sequencing, these data sources are expanding to include phenotypic, clinical outcome, and epidemiological data, as well as new levels of in vivo imaging and proteogenomic data from an increasing breadth and diversity of samples. Rapid increases in the wealth and diversity of big data collections that NCI and other institutions make available to the wider scientific community: e.g., the
- Advanced laboratory analysis technologies that allow empirical observations at increasingly finer levels of resolution These allow observations that range from entire living systems, tissues, single cells, and subcellular components to atomic-level interactions among biomolecules. The latter has been achieved by cryo-electron microscopy, whose inventors recently received a Nobel prize for work that spanned more than 20 years.
- The long-term collaborative projects pursued by the DOE to radically scale up computational capabilities across disciplines The DOE has an explicit mandate to develop exascale computing resources as part of the National Strategic Computing Initiative. A central element of DOE strategy is to develop cross-cutting technology to unite predictive simulation, deep learning, and big data analytics. The computational methods developed under JDACS4C will have a reciprocal impact on DOE's mission science.
- Changes in scientific culture Most relevant to the success of JDACS4C is a growing acceptance of open-science principles, such as data sharing, and a greater willingness to form teams and collaborate across institutions and disciplines ranging from engineering, physics, mathematics, statistics, and computer science to the subspecialties of cancer biology, epidemiology, clinical oncology, and implementation science. Not coincidentally, these same disciplines are required to pursue systems-level approaches necessary to understand the complexities of cancer biology.
While increased speed in data management and analysis is the most immediately tangible aspect of extreme-scale computing, the most meaningful advances lie in the ability to handle computational complexity by building larger simulations to better capture the biological complexity and heterogeneity of cancer. High-performance capabilities rest on the foundation of massively parallel computing processes, which are required for predictive modeling and simulation and, ultimately, precision medicine. The effectiveness and validity of modeling and simulation depend upon statistical analysis and probabilistic thinking. Very large data collections increase statistical power, and analytical algorithms use ever-more-finely determined parameters drawn from experimental data. But, even with these many boundaries defined, models remain probabilistic, not exact, representations of complex biological phenomena.
Empirical observations will always remain ground truth as they represent observed clinical and scientific reality. What is changing is our view of the complexities of cancer that become apparent as data, domain knowledge, and simulation converge. In addition to rigorous methods of verification and validation, the JDACS4C teams are undertaking a cross-cutting effort of uncertainty quantification. Uncertainty quantification addresses critical aspects of using data and complex simulations to assign a confidence interval to the output of new deep learning techniques. Even with the inherent complexities of biology, new insights, new approaches, and new capabilities enable us to transform the approaches taken to accelerate and advance cancer research.
This work has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and the National Cancer Institute (NCI) of the National Institutes of Health.
This work was performed under the auspices of the U.S. Department of Energy by Argonne National Laboratory under Contract DE-AC02-06-CH11357, Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, and Oak Ridge National Laboratory under Contract DE-AC05-00OR22725. LLNL-MI-751204