NCI-DOE Collaboration AI/ML Resources

The NCI-Department of Energy (DOE) Collaboration fosters collaborative biomedical research by providing cutting-edge computational models, algorithms, data sets, software, and other resources, which are openly available to the broader research community. Our goal is to help you access, implement, and extend the high-impact resources developed and released by the NCI-DOE Collaboration.

On this page, you can find a comprehensive listing of the NCI-DOE Collaboration’s artificial intelligence and machine learning (AI/ML) resources, along with links to corresponding publications. We update this page as new resources are released. You can find these resources on GitHub and through the NCI Predictive Oncology Model and Data Clearinghouse.  

You can also view a full list of publications associated with the AI/ML resources.

AI/ML Resources

Key for openly available software, data sets, or models:
  • Pre-review: not independently validated
  • Released: independently reviewed and validated
  • Enhanced: added features to make it easier to use or adapt


Modeling Outcomes Using Surveillance Data and Scalable AI for Cancer (MOSSAIC)

MOSSAIC (an extension of Pilot 3—Population Information Integration, Analysis, and Modeling for Precision Surveillance) applies natural language processing (NLP) and deep learning algorithms to population-based cancer data collected by NCI's Surveillance, Epidemiology, and End Results (SEER) program. MOSSAIC advances computational and informatics solutions to support SEER and lays the foundation for an integrative data-driven approach to modeling cancer outcomes at scale and in real time.

You can find descriptions of and access the MOSSAIC (and Pilot 3) AI/ML resources and related publications below.

AI/ML Capability Status Type Description Impact Publication reference number(s)
HiSAN Released Model (trained**) Hierarchical self-attention network for information extraction from cancer pathology reports. Allows automatic information extraction from free-form pathology report texts. More accurate than MT-CNN. 92
MT-CNN Released Model (trained**) A convolutional neural network for natural language processing and information extraction from free-form texts. Allows automatic information extraction from free-form pathology report texts. Faster than HiSAN. 1, 3, 8, 9, 34, 44, 70, 92, 93
ML Ready Pathology Reports Released Data Set Machine learning ready pathology reports with the associated site and histology labels downloaded from the Genomic Data Commons. Enables users to have a pathology report data set to use with many of the other capabilities.  
Active Learning for NLP Systems Released Software Offers an active learning framework for natural language processing of pathology reports. Enables rapid annotation of pathology reports via machine learning.  
SYNDATA Released Software Suite of statistical and machine learning methods to generate discrete/categorical synthetic data. Can produce real-like synthetic clinical data when access to real patient data is limited 69
Multitask Deep Neural Network (DNN) Released Model (trained**) Multitask DNN for information extraction of text. Allows automatic extraction of cancer-relevant information from free-text pathology reports. 1
Recurrent Neural Network, Long Short-Term Memory (RNN-LSTM) Released Model (trained**) LSTM model to generate synthetic biomedical text of desired clinical context. Creates examples of unstructured text with a specific label from a given corpus that can be used for training machine learning or deep learning models on clinical text.  

*No coefficients (parameter values) established. Trained models will be added as they become available.

**Trained model is defined by combining untrained model + data + weights.

AI-Driven Multiscale Investigation of the RAS/RAF Activation Lifecycle (ADMIRRAL)

ADMIRRAL (an extension of the Pilot 2—Improving Outcomes for RAS-related Cancers) aims to develop a mechanistic understanding of RAS-RAF-driven cancer initiation and growth. Combining machine learning, high performance computing, and experimentation, the project will delineate large-scale domain rearrangement (with molecular resolution) of the RAS-RAF complex and describe the activation of RAF kinase.

You can find descriptions of and access the ADMIRRAL (and Pilot 2) AI/ML resources and related publications below.

AI/ML Capability Status Type Description Impact Publication reference number(s)
MuMMI Released Software Multiscale Machine-Learned Modeling Infrastructure (MuMMI): Developed to study KRAS and RAF Kinase protein dynamics, interactions, and mechanisms in the context of a realistic membrane at length and time scales relevant for gaining biological insights. MuMMI orchestrates massively parallel, multiscale simulations using an ML-driven sampling framwork, enabling the execution of 100,000s of simulations to explore sufficient space to generate statistically significant measurements and generate new hypotheses. Produces data like KRas4B Campaign 1 Trajectory data for use in models. 56, 110
KRas4B: Campaign 1 Trajectory Released Data Set Trajectory data from the simulations of KRas4B in membranes. Two data sets are available: Splash Run 2 and Splash Run 4. Utilities to analyze the data sets are not currently available. Membrane interactions of the globular domain and the hypervariable region of KRas4B define its unique diffusion behavior. 60
Crystal structure of KRAS bound with RAF1 RBDCRD Released Data Set Crystal structures of wild-type and oncogenic mutants of KRAS complexed with the RAS-binding domain (RBD) and the membrane-interacting cysteine-rich domain (CRD) from the N-terminal regulatory region of RAF1 are elucidated. Three structures related to Pilot 2 are listed: 6XI7, 6XHB, 6VJJ. This novel structure enables drug discovery of inhibitors against this complex.  
Crystal structure of RBDCRD alone or bound to membrane mimetic Released  Data Set Crystal structures of RBDCRD alone or bound to membrane mimetic. Three structures related to Pilot 2 are listed: 6VC8, 6VJJ, 5TB5. Detailed structure allows more accurate modeling of protein-membrane interactions.  
MemSurfer Released Software Computes and analyzes membrane surfaces found in a wide variety of large-scale molecular simulations. MemSurfer works independent of the type of simulation, directly on the 3D point coordinates. Enables assessment of lipid membrane curvature and density; allows counting of normal lipids and area per lipid. Also provides a simple-to-use Python API to perform other types of analysis. 29, 50
DynIm Released Software This is the first tool to perform “dynamic” sampling where the input distribution can change over time and the sampling adapts itself to the new distribution. Enables machine learning-based adaptive multiscale simulations for cancer biology.  
P2B2 - Autoencoder Released Model (trained**) A neural network model that reduces its inputs to a smaller set of features and subsequently builds the features back up from the minimally-sized "latent space" while attempting to accurately recreate the inputs - a dimensionality reduction algorithm. Used to generate a tractable set of features from a larger input data set that can then be fed into additional models for a variety of purposes.  

*No coefficients (parameter values) established. Trained models will be added as they become available.

**Trained model is defined by combining untrained model + data + weights.

Innovative Methodologies and New Data for Predictive Oncology Model Evaluation (IMPROVE)

IMPROVE (an extension of Pilot 1—Predictive Modeling for Pre-Clinical Screening) develops approaches and a framework for comparing and evaluating deep learning drug response prediction models. The IMPROVE project team, together with the broader scientific community, discovers methods to identify deep learning model attributes—and new data—that contribute to prediction performance and reproducibility to improve future models.

We will post new AI/ML resources as we release them.

Predictive Modeling for Pre-Clinical Screening: Pilot 1

Pilot 1 focused on developing predictive models of drug responses in pre-clinical cancer screening to improve and expedite the selection and development of new targeted therapies for patients with cancer.

While this pilot concluded in 2020, you can access all the AI/ML resources from Pilot 1 in the table below.

AI/ML Capability Status Type Description Impact Publication reference number(s)
Uno: Unified drug response predictor Released Model (trained**) Shows how to train and use a neural network model to predict tumor dose response across multiple data sources. Initial data provided from: CCLE, CTRP, gCSI, GDSC, NCI60 single drug response, and ALMANAC drug pair response. Enables drug discovery, drug response prediction from cell lines.  
Combo: Combination drug response predictor Released Model (trained**) Predicts combinations of drug responses under different experimental configurations. Enables predictions of drug responses under different experimental configurations. 23
Cancer Drug Response Prediction Released Data set Provides dataframes (e.g., gene expression and drug response data, molecular descriptors, etc.) and supporting metadata used by Combo, P1B3, and Uno models in the Pilot 1 project.  Make it possible to systematically model tumor drug response with deep learning models more suited for large-scale data.  
Enhanced COXEN Released Software Enhanced Co-Expression Extrapolation (COXEN) gene selection method extends the original COXEN method to select genes that are predictive of the efficacies of multiple drugs for building general drug response prediction models that are not specific to a particular drug. Enables building of anti-cancer drug response prediction models using selected genes and drugs. 80
IGTD: Imaging Generator for Tabular Data Released Software Transforms tabular data into images by assigning features to pixel positions so that similar features are close to each other in the image. Convolutional neural networks (CNNs) can be built based on the image representations for prediction tasks. 100
LC: Learning Curves Released Software Learning curves is an empirical method that allows evaluation of a supervised learning model to determine if it can be further improved with more training data. May help to decide whether it would be worthwhile to collect more data and provide a framework for assessing the data scaling behavior of these predictors. 86
TC1: Tissue type classifier Released Model (trained**) Allows classification of tumor type based on sequence data; these augment existing data quality control methods. Augments existing data quality control methods.   
NT3: Normal-tumor pair classifier Released Model (trained**) Classifies tumor type; augments existing data quality control methods. Offers a 1D-convolutional network for classifying RNA-seq gene expression profiles into normal or tumor tissue categories. 57

P1B1: Gene expression autoencoder

Released Model (trained**) Given a sample of gene expression data, builds a sparse autoencoder that can compress the expression profile into a low-dimensional vector. Offers an autoencoder to collapse high-dimensional expression profiles into low-dimensional vectors without significant loss of information.  

P1B2: Mutation classifier

Released Model (trained**) Given patient Somatic SNPs, builds a deep learning network that can classify the cancer type. Offers a means for classifying sparse data.  

P1B3: Single Drug Response Predictor

Released Model (trained**)

Given drug screening results on NCI60 cell lines, builds a deep learning network that can predict the growth percentage from cell line gene expression data, drug concentration, and drug descriptors.

Enables prediction of growth percentage of a cell line treated with a new drug.

ANS: Autoencoder Node Saliency Released Software The purpose of Autoencoder Node Saliency (ANS) is to identify the saliency of hidden nodes in autoencoders by ranking hidden nodes in the latent layer of the autoencoder according to their capability of performing a learning task.  Explains the unsupervised learning process in autoencoders. 33
CLRNA: Semi-Supervised Feature Learning with Center Loss Released Software Semi-supervised, autoencoder-based, machine learning procedure, which learns a smaller set of gene expression features that are robust to batch effects using background information on a cell line or tissue’s tumor type. Dimension reduction of gene expression data using a deep learning algorithm – enables learning about more generalized gene expression features for drug response.  

*No coefficients (parameter values) established. Trained models will be added as they become available.

**Trained model is defined by combining untrained model + data + weights.

Accelerating Therapeutics for Opportunities in Medicine (ATOM)

ATOM accelerates drug discovery through integrated AI, high performance computing, and biomedical data. ATOM employs active learning to identify and optimize new compounds to satisfy multiple pharmaceutical parameters concurrently. In addition, ATOM delivers open models, data, and software for use by the community to shorten the time to discovery and optimization of molecules, including new treatments, probe molecules, and imaging agents.

You can find descriptions of and access the ATOM AI/ML resources and related publications below. For additional information, visit the ATOM website.

AI/ML Capability Status Type Description Impact Publication reference number(s)
ATOM Modeling PipeLine Released Software Offers an open source, modular, extensible software pipeline for building and sharing models to advance in silico drug discovery. Extends the functionality of DeepChem and supports an array of machine learning and molecular featurization tools. AMPL benchmarks on a wide range of parameters are currently available for several pharmaceutical data sets. 66, 73, 90

NCI-DOE Collaboration Infrastructure

CANcer Distributed Learning Environment (CANDLE)

Co-developed by DOE and NCI’s Frederick National Laboratory for Cancer Research, with support from the DOE’s Exascale Computing Project, CANDLE is an open source software platform that provides deep learning methodologies for accelerating cancer research. CANDLE is available on GitHub and on the NIH Biowulf cluster.

AI/ML Capability Status Type Description Impact Publication reference number(s)
CANDLE Software Stack Enhanced Software Improves machine/deep learning models by performing hyperparameter optimization. Enables hyperparameter optimization on machine/deep learning models.  

The Predictive Oncology Model and Data Clearinghouse (MoDaC)

MoDaC is a portal that has predictive oncology data sets and computational models. MoDaC allows you to search, download, and use NCI-DOE collaboration computational resources.

AI/ML Capability Status Type Description Impact Publication reference number(s)
MoDaC: Predictive Oncology Model and Data Clearinghouse Released Software Platform

Offers a public-facing repository to enable sharing of NCI-DOE Collaboration data sets with the cancer research community. Provides a web-based interface for NCI–DOE researchers to upload large, annotated data sets, which then can be searched by metadata and downloaded. The web application leverages the Data Services API core in the backend to provide access to an S3 object store. Salient features include:

  • Generic, expandable data hierarchy and metadata structure.
  • Metadata-based searches of files and collections
  • Multi-level data access policy for open (without user registration), registered, or controlled access
  • Ability to keep data sets private or restricted (group-level access) until ready for sharing (useful for pre-publication data)
  • Support for data transfers to/from Globus and AWS S3 endpoints.
Vote below about this page’s helpfulness.

Enter the characters shown in the image.