Cancer Data Science Pulse
CANDLE: Scaling JDACS4C Machine Learning Algorithms to Unprecedented Magnitudes
This is the second of a series of posts that discuss the principles underlying the three-year collaborative program "Joint Design of Advanced Computing Solutions for Cancer(JDACS4C)."Investigators from the National Cancer Institute (NCI) and the Frederick National Laboratory for Cancer Research have been working collaboratively with experts in informatics and physical scientists affiliated with several national laboratories supported by the Department of Energy (DOE): principally Argonne, Los Alamos, Lawrence Livermore, and Oak Ridge. Their main goal is to develop, demonstrate, and disseminate computational approaches designed to answer challenges in aggregating and analyzing heterogenous data sets using machine or deep learning capabilities to model and simulate the biology of cancer, thereby bringing us closer to making predictive precision oncology a reality. Their solutions have been developed through three pilot studies addressing the specific computational needs of scientists working at the cellular, molecular, and population levels. Together, these pilots are intended to pioneer new approaches to research and clinical translation that can be adopted and scaled up. The broader cancer community and, most importantly, patients, will benefit when today's emerging exascale computing capabilities become accessible tomorrow and mainstream in the future. The CANcer Distributed Learning Environment (CANDLE) is the first exascale application that enables scaling deep learning algorithms to DOE supercomputers.
Launched in 2016 in response to a convergence of national initiatives including the National Strategic Computing Initiative,NCI Cancer Moonshot, and the Exascale Computing Initiative, the CANcer Distributed Learning Environment (CANDLE) is a multi-laboratory collaborative project led by Argonne National Laboratory involving scientists and computational experts from Lawrence Livermore National Laboratory, Oak Ridge National Laboratory, Los Alamos National Laboratory and Frederick National Laboratory for Cancer Research. With a synergy commonly found throughout the JDACS4C collaboration, National Cancer Institute (NCI) and Frederick National Laboratory scientists involved in the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) work closely with scientists at the Department of Energy (DOE) supported national laboratories in developing capabilities found in CANDLE. As a DOE supported Exascale Computing Project (ECP), the lessons and insights of the CANDLE collaboration also extend to software engineers and computational scientists affiliated with private industry, including such companies as NVIDIA, Cray, IBM, AMD, and Intel.
Essentially, the foundation of CANDLE is a synthesis of the innovations in scalable deep neural network machine or deep learning code being co-developed within the context of the challenges embodied in three JDACS4C pilot studies, all of which are based on pre-existing NCI petascale-level data resources and ongoing research initiatives:
- Cellular Level Pilot Predictive computational model development to identify drug candidates and to predict patient responses to single agents or combinations of agents at a cellular-level resolution The pilot effort employs a supervised deep learning methodology augmented by stochastic pathway modeling and experimental design.
- Molecular Level Pilot Complex membrane-bound RAS and RAS-complex spatial and temporal dynamics at molecular and sub-molecular levels of resolution It employs an unsupervised deep earning methodology to simulate protein interactions in cancer-related biological pathways at multiple scales (essentially a systems-biology approach).
- Population Level Pilot Population-scale information extraction and analysis of unstructured clinical/phenotype data and agent-based simulations It employs a semi-supervised deep learning methodology to predict disease recurrence and progression.
The breadth of the machine learning algorithms used in the pilots - unsupervised, semi-supervised, and supervised - has enabled developers to embed deep learning into every aspect of the DOE iterative co-development paradigm (from big data integration and analytics to modeling and simulation). The resulting "cancer learning system" joins data, prediction, and feedback in a learning loop that holds great potential to revolutionize how we understand the biology of cancer and how we prevent, detect, diagnose, and treat neoplastic diseases.
The DOE leads intend to scale up this learning system to achieve exascale capacity while running CANDLE on its most powerful upper-petascale computers, three of which are main elements of the DOE CORAL collaboration: the Argonne 11.69-petaflops Theta system; the Oak Ridge Summit system with a >40 petaflop performance level and the TITAN system with a peak performance level of >20 petaflops; the Lawrence Livermore/National Nuclear Security Administration Sierra system with a 120-150 petaflop peak performance level; and the National Energy Research Scientific Computing Center system with a 30 petaflop peak performance level . (Note: One petaflop equals one quadrillion calculations per second.) These pre-exascale systems are equipped with early instances of exascale technologies that DOE developers are using to optimize the CANDLE code and facilitate transition to a true exascale system, which is expected by 2021.
The collaborative teams achieved several milestones during 2017, with many aimed at expanding deep or machine learning capabilities. In March, they delivered a variational autoencoder (i.e., a multilayer neural network for unsupervised learning), a classifier built on a multilayer perceptron, or neural network, using long short-term memory architecture, and a multilayer perceptron using local contrast normalization for logistic regression of drug response for the Cellular Level Pilot. The Molecular Level Pilot team built an autoencoder for unsupervised learning intended to detect molecular dynamics trajectory states and a recurrent neural network with a long short-term memory architecture to control molecular-dynamics simulation. And the Population Level Pilot developers devised a recurrent neural network with a long short-term memory architecture for analyzing the unstructured text of cancer pathology reports and a convolutional neural network to extract terms from such reports.
In July 2017, CANDLE version 0.1, designed to enable model exploration at scale, was released. The developers configured architectural components to carry out hyperparameter optimization and added deep neural networks and workflows as well as data management, and visualization capabilities. Efficient hyperparameter optimization requires high levels of parallelized searching across potentially millions of nodes and then mapping the search onto the hardware - it is one of the central and most difficult challenges faced when developers attempt to attain exascale capabilities.
In August 2017, developers demonstrated a prototype deep neural network for extracting data elements such as biomarkers and metastasis from existing pathology reports in the NCI Surveillance, Epidemiology, and End Results (SEER) Program database. The prototype also includes examples of how to determine optimum network architecture for maximizing precision, recall, and F-score figures of merit (FOMs). This achievement was rapidly followed in October 2017 with the delivery of a deep learning network architecture for encoding multiple molecular features (e.g., gene expression, microRNA, and proteomics), drug descriptors and fingerprints, and predictions of drug response - the panoply of data elements informing the pilots. The new deep neural network received a remarkably high performance rating: it integrated multiple feature categories and explained more than 92% of variance in drug response, outperforming conventional machine learning models.
Throughout the rapid development of the CANDLE platform, the researchers have conducted systematic outreach to the wider research community sharing data, computing resources, and findings. They have made the public data sets they used immediately available without restriction through an FTP site. After developing and testing, new code has been routinely deposited in a dedicated GitHub repository. Peer-reviewed journal articles have been published steadily, and are under development. Conference slide presentations, posters, and training activities continue to be disseminated and CANDLE workshops continue to gain popularity.
To assure the wide accessibility of CANDLE, as soon as it became minimally functional, CANDLE was introduced into Biowulf, the cluster of >90,000 interconnected computing nodes providing high-performance computing services to National Institutes of Health (NIH) investigators; the network is presently at petascale capacity. After further development, the Biowulf instance of CANDLE became generally available to the intramural community with some laboratories in the NCI Center for Cancer Research aiming to leverage its capabilities and contributing to its ongoing development. Because of the iterative and collaborative nature of co-development, users of CANDLE in effect become co-creators of the computation environment. The opportunities for further collaborations continue to expand, as a growing number of groups beyond DOE and NCI begin exploring the capabilities introduced into CANDLE.
Development for 2018 is already well underway. During this time, the emphasis will widen to include not only the ongoing refinement and integration of a variety of neural network architectures, but also the achievement of data and model parallelism across compute nodes. Part of the plan is to integrate the toolkit for the still-developing Livermore Big Artificial Neural Network (LBANN) software package into CANDLE. The LBANN software is focused on training neural network models of unprecedented scale using vast stores of data.
Finally, developers (both life-science researchers, experts in informatics or the physical sciences, and statisticians) will continue to engage in demonstration projects aimed at improving and refining the performance of deep learning algorithms; their activities will cover several areas:
- optimization for molecular dynamics simulation
- achievement of multi-task learning that extracts multiple targets (text and data) and maximization of learning capacity through optimizing network architecture
- addition of descriptions of drug targets thereby enhancing the cell-line properties set
- creation of a prototype deep neural network for extracting information from clinical records
- development of a deep neural network prototype that includes drug descriptors, drug fingerprints, and multiple molecular properties of tumors
Altogether, the development and dissemination of the CANDLE learning environment is a credible and exciting step toward realizing a key recommendation of the Blue Ribbon Panel appointed by the NCI National Cancer Advisory Board in response to the Cancer Moonshot challenge of achieving ten years' worth of research progress in five years to create a national infrastructure for sharing and processing cancer data. Our ability to accelerate progress against cancer demands that researchers, clinicians, and patients across the country collaborate in sharing their collective data and knowledge about the disease. This broad infrastructure will enable researchers to more effectively mine cancer-related data to develop new strategies to prevent, diagnose, and treat cancers as well as to understand the fundamental nature of the disease.
 Enhanced Data Sharing Working Group Recommendation: The Cancer Data Ecosystem, accessed on June 5, 2018, https://www.cancer.gov/research/key-initiatives/moonshot-cancer-initiative/blue-ribbon-panel/enhanced-data-sharing-working-group-report.pdf