Cancer Data Science Pulse

NCI’s Cloud Resources Help Tame Today’s Data Windfall

Extraordinary improvements in data-generation technologies and plummeting costs over the past decade have led to a veritable explosion in the amount and variety of data sets available today. Leveraging these data to probe the biology of cancer presents unprecedented opportunities to fuel scientific breakthroughs in cancer research.

Yet this golden age of data generation also creates considerable challenges, as data sets stretch the limitations of the traditional data-sharing model. Until now, data were siloed in separate repositories or stored on each researcher’s desktop. This meant that data had to be downloaded to be useful. Unfortunately, as data sets grow ever larger, transfer times, computing power requirements, and data-management burdens have grown exponentially, putting researchers, particularly those at organizations without robust data infrastructures, at a disadvantage.

Many government agencies are working to solve these issues. Their goal is to make these data sets available on publicly accessible cloud platforms, along with powerful and highly scalable computing resources.

"Image comparing two approaches to data sharing. 
On left side, diagram exhibits the “Traditional approach, bring data to researchers.” Four server icons within a blue box have arrows pointing out towards four different computers with four different people and tools. On the right side, diagram exhibits the “Cloud-centric approach, bring researchers to data.” Four people have arrows pointing into the blue box where four servers and four computers with tool exist."

This diagram illustrates the key difference between a traditional versus cloud-centric approach to data sharing and access.

Any researcher with appropriate credentials can access and analyze data in the cloud without ever having to transfer a single byte of information. Researchers can tap into as much computing power as they need and break free of the infrastructure constraints imposed by their individual institutions. Through the cloud, researchers can federate multiple data sets, representing a rich tapestry of biological tissues and cellular states, and perform multi-omic analyses that have the potential to generate novel insights. Moreover, they can work collaboratively, sharing data, tools, and knowledge across institutional and international borders.

Introducing the NCI Cloud Resources

To truly advance data sharing and analysis, this cloud-based model needs resources that can be used by people with a wide variety of skills, and not just those with a background in computing and engineering.

NCI’s Cancer Research Data Commons (CRDC) aims to address this need through cloud-based, intuitive interfaces offered by three NCI Cloud Resources: the Cancer Genomics Cloud, powered by Seven Bridges; the Broad Institute’s Firecloud, powered by Terra; and the ISB Cancer Gateway in the Cloud.

Through the CRDC, researchers have the data and computational power they need to better understand how cancer develops, how it progresses, and how it might be more effectively treated.

As illustrated in this graphic, the three NCI Cloud Resources provide researchers, clinicians, and data scientists access to the data within NCI's Cancer Research Data Commons and computational tools. Text reads "NCI Cloud Resource. Broad Institute FireCloud, ISB Center Gateway in the Cloud, Seven Bridges Cancer Genomics Cloud. Scalable Compute, Analysis Tools, Secure Workspaces. As illustrated in this graphic, the three NCI Cloud Resources provide researchers, clinicians, and data scientists access to the data within the NCI's Cancer Research Data Commons and computational tools."

As illustrated in this graphic, the three NCI Cloud Resources provide researchers, clinicians, and data scientists access to the data within NCI's Cancer Research Data Commons and computational tools.

Each Cloud Resource is an independent platform built on top of one or more commercial clouds such as Amazon Web Services (AWS) and Google Cloud. Each provides a unique combination of built-in tools and interfaces designed to empower researchers with all levels of experience, from bench scientists to bioinformaticians.

Working alongside the CRDC data repositories, the Cloud Resources enable researchers to access and analyze a deep catalog of data sets amounting to more than 3 petabytes of data, including those from The Cancer Genome Atlas Program (TCGA), The Cancer Imaging Archive (TCIA), and the Clinical Proteomic Tumor Analysis Consortium (CPTAC). Researchers also can bring their data to the cloud, analyze their findings in combination with other CRDC-hosted data sets, and leverage a variety of fully compatible data sources contributed by other organizations.

Screenshots of the three NCI Cloud Resources Data Portals respectively: the Cancer Genomics Cloud with text reading "Learn from cancer omics data Faster" and images of four circles with the text "The CGC History", "Access to TCGA Data", "Bring your Tools", Bring your Private Data"; the Broad Institute’s Firecloud, text reading "Welcome to Firecloud. FireCloud is a NCI Cloud Resource project powered by Terra for biomedical researchers to access data, run analysis tools, and collaborate. Find how-to's, documentation, video tutorials, and discussion forums. Already a FireCloud User? Learn what's new. Learn more about the CRDC and other NCI Cloud Resources. View Workspaces, View Examples, Browse Data"; and the ISB Cancer Gateway in the Cloud with text reading "A Resrouce of the NCI CRDC ISB-CGC Cancer Gateway in the Cloud. Access, Explore and Analyze Large-Scale Cancer Data Through the Google Cloud. BigQuery Table Search, Cancer Data File Browser, Chromosomal Aberrations and Gene Eusions DB."

Screenshots of the three NCI Cloud Resources Data Portals respectively: the Cancer Genomics Cloud; the Broad Institute’s Firecloud; and the ISB Cancer Gateway in the Cloud.

Acting as the analytical "powerhouse" of the CRDC, the Cloud Resources enable researchers to analyze data in a variety of ways, from running compute-intensive automated workflows on massive amounts of data, to exploring, visualizing, and analyzing data with popular data science and bioinformatics applications, such as Jupyter Notebooks, Rstudio, and Galaxy.

We invite you to take advantage of a treasure trove of community-contributed workflows and notebooks or bring your workflows into the system and share code and tools with collaborators.

The Real Impact of NCI Cloud Resources

There is a lot more to say about the Cloud Resources than we can cover in a single blog post. Rather than post a laundry list of features, we will be highlighting the core strengths of each of the Cloud Resources through a series of additional blog posts. Each blog will focus on a single platform and include real-world success stories of cancer researchers and computational scientists using the Cloud Resources.

In the meantime, please visit the Cloud Resources on the CRDC website for more details:

Deena Bleich
Bioinformatician, Institute for Systems Biology’s Cancer Gateway in the Cloud
Annie Kuan
Senior Project Manager, Data Sciences Platform, Broad Institute of MIT and Harvard
David Pot, Ph.D.
Co-Investigator, Institute for Systems Biology’s Cancer Gateway in the Cloud
Manisha Ray, Ph.D.
Scientific Program Manager, Seven Bridges
Sai Lakshmi Subramanian
Program Manager, Cancer Genomics Cloud, Seven Bridges
Geraldine Van der Auwera, Ph.D.
Director, Outreach and Communications, Data Sciences Platform, Broad Institute of MIT and Harvard
Older Post
For the Love of . . . Data! Dr. Jerry Li Describes the Next Data Revolution
Newer Post
For the Love of … Data! Drs. Kibbe and Almeida Discuss How Data Help Reveal Our Natural World

Leave a Reply

Vote below about this page’s helpfulness.

Your email address will not be published.

CAPTCHA

Enter the characters shown in the image.

How could I say to you for best way I cannot found it; but I would say simply thank you very much; deeply!!
I will study harder from data and dater.
Thank you again to all.