Cancer Data Science Pulse
Predictive Modeling for Pre-clinical Drug Screening: Improving Models Derived From Observational Studies Using Machine Learning and Simulation
This blog post, the fifth, concludes our series that discusses the principles underlying the collaborative project "Joint Design of Advanced Computing Solutions for Cancer (JDACS4C)." Investigators from the National Cancer Institute (NCI) and the Frederick National Laboratory for Cancer Research have been working collaboratively with experts in applied machine learning and computer science using the world's largest computers affiliated with several national laboratories stewarded by the U.S. Department of Energy (DOE): principally Argonne, Los Alamos, Lawrence Livermore, and Oak Ridge. Their main goal is to develop, demonstrate, and disseminate computational approaches designed to answer challenges in aggregating, integrating, and analyzing large, heterogeneous datasets and using the results to build predictive models that can be tested and refined through simulation and validated principally by comparison with experimental data. Such mathematical/statistical approaches complement the direct study of cells and animal models. When combined, they can result in hybrid models that better approximate the biology of human cancer and therefore have the potential to improve risk identification, pre-clinical drug screening, and treatment selection when translated into the clinic. NCI has a long tradition of developing and characterizing cancer cell lines and animal models designed to recapitulate human disease as closely as possible, identify new anti-cancer compounds for drug development, and translating that knowledge to better predict individual responses to specific therapeutic agents. These are primary responsibilities of the NCI Division of Cancer Treatment and Diagnosis (DCTD), whose director, Dr. James Doroshow, is the NCI co-lead for the Cellular Level Pilot.
In a complementary manner, the DOE has a rich history of pushing frontiers in computing and DOE Laboratories have some of the most significant high-performance computing resources available, including some of the fastest supercomputers in the world. DOE develops and deploys high-performance computing facilities that accelerate scientific discovery and advance its national security mission. DOE National Laboratories apply these leading supercomputing capabilities to develop new approaches and techniques that enable computational scientists to solve important large-scale problems in areas including bioscience, environmental science, math, and computer science. Professor Rick Stevens, associate laboratory director for Computing, Environment and Life Sciences at Argonne National Laboratory, is the DOE co-lead for the Cellular Level Pilot.
The Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) Cellular Level Pilot aims to improve the predictive capabilities of pre-clinical screening using advanced computation - in this case, combining observational data derived from experiments with cancer cell lines and from animal models with statistical analysis, deep and machine learning, modeling, and simulation. Necessary to achieve these aims is the integration of data derived from mechanistic biological models (cell lines and immunocompromised mouse models bearing patient-derived xenografts [PDXs]) with virtual models capable of scaling up machine learning approaches, predictive modeling, simulation, validation, and uncertainty quantification (UQ) in ways that take advantage of the strengths and compensate for the weaknesses of each approach.
On the institutional level, National Cancer Institute (NCI) scientists hope to expand the breadth of treatment options for precision oncology and support clinical decision-making, while Department of Energy (DOE) investigators anticipate carrying through with the co-design of computational architectures that integrate learning systems with modeling and simulation. Both groups require high-performance computation at the near-exascale or exascale level to rapidly develop, simulate, and validate predictive pre-clinical models, identify requirements for future architecture, and design software to support an integrated, hybrid approach to biological and computational modeling.
In the first year, the pilot team accomplished several goals necessary for data integration and model validation. Collections curated by the National Institutes of Health (NIH) and NCI - and made available through multiple repositories including the Library of Integrated Network-Based Cellular Signatures (LINCS) repository, NCI ALMANAC Study, the NCI Patient Derived Models Repository (PDMR), and the Genomic Data Commons (GDC) - provided the pilot team with the raw material they used to achieve first-year goals surrounding data integration and model validation.
- developed pipelines for generating comparable expression profiles from diverse sources and platforms, including microarray-based gene-expression studies and next-generation sequencing-based multi-omics analyses of cell lines such as the NCI-60, patient tumor samples, and PDXs;
- clustered expression profiles across cell lines, cell-line xenografts, PDX models, and patient samples;
- developed methods to transfer models from cell-line data to cell-line xenograft data, and finally to PDX data;
- developed and stored abbreviated molecular signatures in a Model Response Library, which radically reduced the number of features used to formulate optimal modeling parameters while simultaneously minimizing the loss of predictive power; and
- performed trade-off studies comparing the number of assays and the number of samples to help determine relative importance in differing research contexts, again focusing on methods of data selection.
The Cancer Distributed Learning Environment (CANDLE) project has been a key component of the efforts in the Cellular Level Pilot, providing the technical capability and horsepower required to explore the combinations of data and models critical to the success of this effort. These activities mark a critical transition that moves beyond data collection, curation, and dissemination into determining appropriate uses of data types within specific studies and formulating new hypotheses in high-performance computing environments.
One of the more significant early findings by the research team is that hybrid models - models that encompass and integrate observational data with virtual data - are often more accurate than those based on either experimental findings or virtual data alone. However, it is also true that in vitro cell lines, PDX models, and in silico models using mathematics, statistics, and inference each inevitably have profound limits. These differences in behavior - among cells in culture, animal models, and virtual models in the context of the complex, essentially stochastic biology of human cancer - cannot be erased, but they can be narrowed. In the hybrid models under development, that distance is at least partially bridged by unsupervised machine learning, which improves classification accuracy using complex three-dimensional feature encoding.
Like all other modeling paradigms, hybrid modeling requires both validation and an acknowledgment in the form of UQ, as complex, stochastic systems, such as human cancer, cannot be fully or exactly known. Data, especially data relevant to rare pathophysiological events, may be too sparse to lend sufficient predictive power to a model, even when petascale-level data collections are available. Moreover, data may reflect biases or simply be erroneous, despite quality controls¬≠¬≠. And further, errors can result from inadequacies in computational methodologies such as statistical analysis or machine learning algorithms.
Much of the researchers' more recent efforts have been devoted to assessing various error-reduction methods. They argue that uncertainty reduction in itself should be a significant factor driving experimental design. UQ is critical to optimal experimental design in that it sets limits on inherently probabilistic predictions. Moreover, it is a critical tool for improving modeling paradigms that support the introduction of mechanistic models into the machine learning framework.
As one of the milestones, the pilot team is building a deep learning model that learns the relationships among the molecular properties of the tumor, the cell lines, and the structural properties of pairs of drugs to quantitatively predict tumor growth inhibition. To be more efficient in improving the model, they have also worked on the incorporation of two methods for UQso that they can better understand how confident they can be in their predictions. As UQ methodologies mature, they will be incorporated into the frameworks of Collaboration of Oak Ridge, Argonne, and Livermore (CORAL) and CANDLE and disseminated to the wider research community.
The table below summarizes the NCI/DOE Cellular Level Pilot team's aims and accomplishments during the first calendar year of JDACS4C.