Cancer Data Science Pulse

Cloud Resources: Cancer Genomics Cloud Helps Power Data Discovery and Analysis to Advance Cancer Research

This blog post continues our in-depth look into NCI’s Cloud Resources. Here we examine the Cancer Genomics Cloud, a tool from Seven Bridges that is helping researchers to access disparate data sets efficiently and effectively. To read more about all the NCI Cloud Resources available today, visit the previous blog, “NCI’s Cloud Resources Help Tame Today’s Data Windfall.”

The exponential growth and continued diversity of complex data sets pose an ongoing challenge for cancer researchers in the data science field. In fact, researchers often point to this hurdle as one reason for avoiding the use of existing data sets for secondary research. Combining different data sets takes significant time and resources. Other obstacles include difficulty discovering, accessing, and sharing data, as well as a lack of computational power.

NCI sought to address these challenges by building the Cancer Research Data Commons (CRDC), a cloud-based ecosystem designed to facilitate the access, analysis, and sharing of cancer data across the cancer research community. The NCI Cloud Resources are an integral part of this ecosystem. These analytic engines help power the CRDC by giving users the computational strength to make meaningful discoveries in cancer research.

One of these engines is the Cancer Genomics Cloud (CGC), developed by Seven Bridges. Allowing researchers to conduct cancer data analysis more efficiently within the cloud-based platform, the CGC pulls together: 

  • large cancer data sets, such as The Cancer Genome Atlas (TCGA), the Clinical Proteomic Tumor Analysis Consortium (CPTAC), and others;
  • more than 600 bioinformatics tools and best-practice multi-omics analysis workflows; and
  • the computational capabilities to perform large-scale analyses.
Figure of Cancer Genomics Cloud Features showing the following features: 1. Easy data management (with an icon of a database) 2. Secure collaboration & managed billing (with an icon of a cloud) 3. Flexible & fully reproducible methods (with an icon of a ball and stick model and restart symbol) 4. Optimized bioinformatics algorithms (icons of a screen with a trendline) 5. Scalable computation (icon of a screen with bar charts) 6. Extensible & developer friendly tools (with an icon of a screen and a script sy

Figure 1: The Cancer Genomics Cloud is a cloud-based platform that brings together access to data, tools, and computational power to improve the ease of analyzing cancer data. 

Altogether, the CGC gives researchers immediate access to more than three petabytes of multi-dimensional data.

Built for Functionality

The CGC platform is built for researchers regardless of their cloud computing skills. A user-friendly portal allows researchers to browse, query, and filter data sets. Researchers also can bring their own data to the CGC to combine with publicly available data. The platform can be used with data stored in either the Amazon or Google clouds, so users can run computations in the location where the data “lives,” eliminating the need for downloading large data sets. Each new user receives $300 in credits to try out the cloud capabilities. In addition, users (particularly students and postdocs) can apply for a Collaborative Project proposal for research questions in new and compelling areas for up to $10,000 in credits.

Additionally, the platform is designed to support the latest in security and FAIR (Findable, Accessible, Interoperable, and Reproducible) principles. Services also support other global technical standards, like guidelines recommended by the Global Alliance for Genomics and Health (GA4GH) for data repository services (DRS), whole exome sequencing (WES), and tool repository services (TRS). The platform uses Common Workflow Language (CWL) to port tools to the CGC, allowing users to access and analyze highly-distributed data in a standardized, reproducible, and efficient manner. This CWL functionality also allows users to bring their own analysis tools to the platform and to use them within a private workspace.

Workflows at Your Fingertips

Screenshot of the CGC public apps webpage (found as the third navigation item on the left). The page info cards for each of the 616 publicly available apps. Three app cards are fully shown: 1. Title - Alignment Metrics QC, version – Toolkit version: SBGTools 1, Description – Running this pipeline will provide you with useful statistics to help you judge the quality of your alignment…, sort labels – quality-control, SAM/BAM-processing, Copy button, Run Button 2. Title – Bismark Analysis, version –Toolkit version: Bismark 0.19.0, Description – Bismark Analysis 0.10.0 Is a workflow for analyzing DNA methylation, a type of epigentic modification, by processing… sort labels – Epigenetics, Methylation, Copy button, Run Button. 3. Title – BROAD Best Practices RNA-Seq Variant Callin…, version –, Description – This workflow represents the GATK Best Practices for SNP and INDEL calling on RNA-Seq data. Starting from an unmapped…, sort labels – Transcriptomics, Variant Calling, Copy button, Run Button.

The Public Apps Gallery on the CGC contains hundreds of tools and workflows.

Users seeking common, best-practice analysis methods can find them within the CGC. Its Public Apps Gallery contains hundreds of pre-built tools and workflows that have been cloud-optimized by the Seven Bridges Bioinformatics Team. The Public Apps cover a wide range of research areas, including RNAseq, mutation or variant analysis, single-cell analysis, proteomics, epigenetics, and imaging. Each tool contains detailed explanations and benchmarking data for time and cost estimates. In addition, the Workflow Editor, a tool within the CGC, allows researchers to wrap their custom workflows in CWL to tailor analysis to their exact needs. Other CGC resources include extensive online documentation, training resources, and technical support from a team of more than 200 expert scientists, bioinformaticians, and engineers, to help both new and advanced users alike.

Screenshot of workflow editor interface. At top, there are tabs for “My Projects” and “Public Apps”. In this left-hand section, you can search for public apps, or browse and select from the list below. To the right is a visual editor window. At the top of the window there are three tabs: “App Info”, “Visual Editor”, and “Code.” The “Visual Editor” view is displayed showing a series of circles connected in the following order: 1. Gene-cell count matrices (with file icon) 2. Load Single-Cell Data (with code icon), 3. Quality Control (with code icon), 4. Normalisation Transformation and PCA (with code icon), 5. Clustering and Biomarker Identification (with code icon), and 4 circles in no order with file icons stating, “Output Seurat object”, “Clustering results table”, “Report”, and “Biomarker plots.”

The workflow editor makes it easy to wrap a tool in CWL for portability to the cloud. 

CGC Case Report

Since its launch in 2016, the CGC has helped thousands of users in their research. To date, more than 6,000 users have run almost 2 million tasks, adding up to thousands of years in compute time. More recently, the CGC was integrated with data repositories within the CRDC, including the Genomic Data Commons, the Proteomic Data Commons, the Integrated Canine Data Commons, and the Cancer Data Service, to support research questions that have become increasingly multi-omic. Researchers can examine multiple data types side-by-side, enabling new insights into cancer progression.

As one example, McKerrow and colleagues at New York University recently examined the role of long interspersed nuclear element-1 (LINE-1), an important driver and diagnostic marker of disease, and its connection to cancer progression. Through the CGC’s cloud-based access to data and computation, the investigators were able to examine the DNA, RNA, and protein expression profiles of corresponding samples in both the TGCA and CPTAC. As a result, they were able to show a correlation between LINE-1 activity and tumor progression in multiple tumor types.

Many researchers have also used the CGC to develop and distribute new tools and methodologies. A recent paper, published by Julia Salzman’s lab at Stanford University, showcases a new algorithm for examining how regions of genes (splice junctions) are excised out of the primary RNA transcript prior to synthesis of the specific protein. Using single-cell data from the CGC, the authors developed a new method that assigns statistical confidence to splice junctions from a spliced aligner to improve precision in single-cell sequencing. Their work demonstrated that the SICILIAN (SIngle Cell precIse spLice estImAtioN) method improves splice junction detection and is applicable to several data types. SICILIAN also helped reveal new regulated splicing patterns in primary human and non-human samples that weren’t evident using previous methodologies.

These are just two ways the platform is being used to promote research into the genes underlying cancer. For additional information on the CGC and these case reports, see the citations below:

Scientific Program Manager, Seven Bridges
Technical Writer, Seven Bridges
Older Post
Blinded by the Light—Seeking the Truth Behind Data Outliers
Newer Post
Three Pillars of Cloud Computing—People, Processes, and Technology

Leave a Reply

Vote below about this page’s helpfulness.

Your email address will not be published.

Hi dears. Could you please help me to know my problem. A year ago, a tumor destroyed my T10 vertebra, but I still do not know exactly what the problem was ? And this treatment that the doctors prescribed for me, was the best method or not?
Sincerely.
NCI provides cancer-related information for your general knowledge, but our information is not a substitute for a doctor’s advice. If you are interested, feel free to call 1-800-4-CANCER or visit https://www.cancer.gov/contact.