NCI-DOE Collaboration Capabilities

The NCI-Department of Energy (DOE) partnership enables future research by making computational tools, algorithms, data sets, and other capabilities resulting from this collaboration available to the broader research community. NCI has established a Capability Transfer Team to help researchers understand, access, implement, and extend the capabilities offered under this initiative. To explore ways you can collaborate, contact our team at

The Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program is a focal point of the strategic, interagency collaboration between NCI and DOE to simultaneously accelerate developments in precision oncology and advanced scientific computing.

Based on a multidisciplinary team science approach, JDACS4C’s three research pilots were co-designed by NCI and DOE, align with several existing NCI and DOE programs, and are jointly led by NCI- and DOE-supported scientists. These teams include scientists from NCI and Frederick National Laboratory for Cancer Research (FNLCR); and experts from DOE national laboratories, principally ArgonneLawrence LivermoreLos Alamos, and Oak Ridge.

Below is a list of NCI-DOE Collaboration capabilities available for public use. This list will be updated as new capabilities are released. For those capabilities not yet transferred, the original location is provided and will be updated with the final location when available. 

A full listing of the latest publications is also available.


Capabilities as of July 22, 2021:

  • Pre-release: software, data set, or model has been completed by a research group and, in their opinion, is ready to be seen by the public.
  • Released: software, data set, or model has been moved from the research group site to a public site by the Capability Transfer Team of FNLCR, defined as Mirrored, Validated, and Released.
  • Enhanced: software, data set, or model that has been transferred and has additional components that 1) increase visibility and activity, 2) allow it to be adapted for broader use, or 3) enable it to be used by the extramural community.


Pilot 1 Predictive Modeling for Pre-Clinical Screening

The Predictive Modeling for Pre-Clinical Screening Pilot aims to develop predictive capabilities of drug responses in pre-clinical models of cancer to improve and expedite the selection and development of new targeted therapies for patients with cancer.

Capability Status Type Description Impact Publication reference number(s)
Uno: Unified drug response predictor Released Model (trained**) Shows how to train and use a neural network model to predict tumor dose response across multiple data sources. Initial data provided from: CCLE, CTRP, gCSI, GDSC, NCI60 single drug response, and ALMANAC drug pair response. Enables drug discovery, drug response prediction from cell lines.  
Combo: Combination drug response predictor Released Model (trained**) Predicts combinations of drug responses under different experimental configurations. Enables predictions of drug responses under different experimental configurations. 17
Cancer Drug Response Prediction Released Data set Provides dataframes (e.g., gene expression and drug response data, molecular descriptors, etc.) and supporting metadata used by Combo, P1B3, and Uno models in the Pilot 1 project.  Make it possible to systematically model tumor drug response with deep learning models more suited for large-scale data.  
Enhanced COXEN Released Software Enhanced Co-Expression Extrapolation (COXEN) gene selection method extends the original COXEN method to select genes that are predictive of the efficacies of multiple drugs for building general drug response prediction models that are not specific to a particular drug. Enables building of anti-cancer drug response prediction models using selected genes and drugs. 57
IGTD: Imaging Generator for Tabular Data Released Software Transforms tabular data into images by assigning features to pixel positions so that similar features are close to each other in the image. Convolutional neural networks (CNNs) can be built based on the image representations for prediction tasks. 72
LC: Learning Curves Pre-Release Software Learning curves is an empirical method that allows evaluation of a supervised learning model to determine if it can be further improved with more training data. May help to decide whether it would be worthwhile to collect more data and provide a framework for assessing the data scaling behavior of these predictors. 60
TC1: Tissue type classifier Released Model (trained**) Allows classification of tumor type based on sequence data; these augment existing data quality control methods. Augments existing data quality control methods.   
NT3: Normal-tumor pair classifier Released Model (trained**) Classifies tumor type; augments existing data quality control methods. Offers a 1D-convolutional network for classifying RNA-seq gene expression profiles into normal or tumor tissue categories. 38

P1B1: Gene expression autoencoder

Released Model (trained**) Given a sample of gene expression data, builds a sparse autoencoder that can compress the expression profile into a low-dimensional vector. Offers an autoencoder to collapse high-dimensional expression profiles into low-dimensional vectors without significant loss of information.  

P1B2: Mutation classifier

Released Model (trained**) Given patient Somatic SNPs, builds a deep learning network that can classify the cancer type. Offers a means for classifying sparse data.  

P1B3: Single Drug Response Predictor

Released Model (trained**)

Sparse Classifier Disease Type Prediction from Somatic SNPs: Given drug screening results on NCI60 cell lines, builds a deep learning network that can predict the growth percentage from cell line gene expression data, drug concentration, and drug descriptors.

Enables prediction of growth percentage of a cell line treated with a new drug.

ANS: Autoencoder Node Saliency Released Software The purpose of Autoencoder Node Saliency (ANS) is to identify the saliency of hidden nodes in autoencoders by ranking hidden nodes in the latent layer of the autoencoder according to their capability of performing a learning task.  Explains the unsupervised learning process in autoencoders. 23

*No coefficients (parameter values) established. Trained models will be added as they become available.

**Trained model is defined by combining untrained model + data + weights.

Pilot 2 Improving Outcomes for RAS-Related Cancers

Improving Outcomes for RAS-related Cancers aims to deliver a validated multiscale model of RAS biology on a cell membrane by combining the experimental capabilities at the FNLCR with the computational resources of the National Nuclear Security Administration (NNSA), a semi-autonomous DOE agency. The principal challenge in modeling this system is the diverse length and time scales involved.

Capability Status Type Description Impact Publication reference number(s)
MuMMI Pre-Release Data Set Multiscale Machine-Learned Modeling Infrastructure to support very large and multiscale simulations of molecular dynamic interactions between proteins (or their domains) with each other or with cell membranes. Currently in the process of packaging the software. Produces data like KRas4B Campaign 1 Trajectory data for use in models. 39
KRas4B: Campaign 1 Trajectory Pre-Release Data Set Trajectory data from the simulations of KRas4B in membranes. Process is underway to make this data publicly available.  Membrane interactions of the globular domain and the hypervariable region of KRas4B define its unique diffusion behavior. 42
Crystal structure of KRAS bound with RAF1 RBDCRD Released Data Set Crystal structures of wild-type and oncogenic mutants of KRAS complexed with the RAS-binding domain (RBD) and the membrane-interacting cysteine-rich domain (CRD) from the N-terminal regulatory region of RAF1 are elucidated. Three structures related to Pilot 2 are listed: 6XI7, 6XHB, 6VJJ. This novel structure enables drug discovery of inhibitors against this complex.  
Crystal structure of RBDCRD alone or bound to membrane mimetic Released  Data Set Crystal structures of RBDCRD alone or bound to membrane mimetic. Three structures related to Pilot 2 are listed: 6VC8, 6VJJ, 5TB5 Detailed structure allows more accurate modeling of protein-membrane interactions.  
MemSurfer Pre-Release Software Computes and analyzes membrane surfaces found in a wide variety of large-scale molecular simulations. MemSurfer works independent of the type of simulation, directly on the 3D point coordinates. Enables assessment of lipid membrane curvature and density; allows counting of normal lipids and area per lipid. Also provides a simple-to-use Python API to perform other types of analysis.  
DynIm Released Software This is the first tool to perform “dynamic” sampling where the input distribution can change over time and the sampling adapts itself to the new distribution. Enables machine learning-based adaptive multiscale simulations for cancer biology.  
P2B2 - Autoencoder Released Model (trained**) A neural network model that reduces its inputs to a smaller set of features and subsequently builds the features back up from the minimally-sized "latent space" while attempting to accurately recreate the inputs - a dimensionality reduction algorithm. Used to generate a tractable set of features from a larger input dataset that can then be fed into additional models for a variety of purposes.  

Pilot 3 Population Information Integration, Analysis, and Modeling for Precision Surveillance

Population Information Integration, Analysis, and Modeling for Precision Surveillance aims to leverage high-performance computing and artificial intelligence to meet the emerging needs of cancer surveillance. Moreover, Pilot 3 NCI-DOE seeks to develop a fully integrated data driven modeling-and-simulation framework to enable meaningful translation of big SEER data.

Capability Status Type Description Impact Publication reference number(s)
HiSAN Released Model (trained**) Hierarchical self-attention network for information extraction from cancer pathology reports. Allows automatic information extraction from free-form pathology report texts. More accurate than MT-CNN. 66
MT-CNN Released Model (trained**) A convolutional neural network for natural language processing and information extraction from free-form texts. Allows automatic information extraction from free-form pathology report texts. Faster than HiSAN. 1, 3, 7, 8, 19, 24, 30, 52, 66, 67
ML Ready Pathology Reports Released Data Set Machine learning ready pathology reports with the associated site and histology labels downloaded from the Genomic Data Commons. Enables users to have a pathology report data set to use with many of the other capabilities.  
Active Learning for NLP Systems Pre-release Software Offers an active learning framework for natural language processing of pathology reports. Enables rapid annotation of pathology reports via machine learning.  

*No coefficients (parameter values) established. Trained models will be added as they become available.

**Trained model is defined by combining untrained model + data + weights.

Accelerating Therapeutics for Opportunities in Medicine (ATOM)

The ATOM Consortium is a public-private partnership whose mission is to transform drug discovery by accelerating the development of more effective therapies for patients.

Capability Status Type Description Impact Publication reference number(s)
ATOM Modeling PipeLine Released Software Offers an open source, modular, extensible software pipeline for building and sharing models to advance in silico drug discovery. Extends the functionality of DeepChem and supports an array of machine learning and molecular featurization tools. AMPL benchmarks on a wide range of parameters are currently available for several pharmaceutical data sets. 48, 54, 64

CANcer Distributed Learning Environment (CANDLE)

CANDLE is an open source, collaboratively developed software platform that provides deep learning methodologies. Driven by scientific challenges in cancer research, as defined by JDACS4C pilot efforts, CANDLE capabilities build on advanced computing support from DOE’s Exascale Computing Project (ECP). CANDLE is deployed on NIH's Biowulf supercomputer.

Capability Status Type Description Impact Publication reference number(s)
CANDLE Software Stack Enhanced Software Improves machine/deep learning models by performing hyperparameter optimization. Enables hyperparameter optimization on machine/deep learning models.  

The Predictive Oncology Model and Data Clearinghouse (MoDaC)

MoDaC is a data-sharing repository developed to transition resources to the broader research community. These resources include data sets and software models from computational capabilities developed within NCI and in collaboration with programs such as JDACS4C and ATOM. Annotated data sets stored in the repository are publicly available and can be searched against their metadata and downloaded for analysis.

Capability Status Type Description Impact Publication reference number(s)
MoDaC: Predictive Oncology Model and Data Clearinghouse Released Software Platform

Offers a public-facing repository to enable sharing of JDACS4C data sets with the cancer research community. Provides a web-based interface for NCI–DOE researchers to upload large, annotated data sets, which then can be searched by metadata and downloaded. The web application leverages the Data Services API core in the backend to provide access to an S3 object store. Salient features include:

  • Generic, expandable data hierarchy and metadata structure.
  • Metadata-based searches of files and collections
  • Multi-level data access policy for open (without user registration), registered, or controlled access
  • Ability to keep data sets private or restricted (group-level access) until ready for sharing (useful for pre-publication data)
  • Support for data transfers to/from Globus and AWS S3 endpoints.
Enables storage and sharing of annotated data sets.