Cancer Data Science Pulse
FY 2017: NCI Looks to the Clouds
As the FY 2017 Annual Plan and Budget Proposal describes, some of the key components of the infrastructure and programs that comprise the National Cancer Program include:
- 69 NCI-designated Cancer Centers across the country,
- NC's National Clinical Trials Network,
- NCI Community Oncology Research Program,
- and the NCI-supported programs that are training today's cancer research workforce and the workforce of the future.
It also includes what has become an absolutely essential element of cancer research: A robust bioinformatics network.
Defining the Challenge
Generally speaking, biomedical informatics encompasses the information technology involved in collecting, managing, analyzing, and visualizing biomedical research data.
These data come from many different places and research settings, and can include outcomes data from clinical trials and results of imaging and next-generation sequencing studies. But many researchers simply can't use much of the data that is being produced. The data are stored in proprietary databases at individual institutions or, as in the case of The Cancer Genome Atlas (TCGA), there is so much data that researchers at many institutions lack the infrastructure and tools required to download, store, or analyze it.
The challenge, then, is how to expand access to these data to more researchers, to enable research that will provide new insights into cancer biology, develop better computational and experimental disease models, and improve the treatment of individual patients.
Big Plans for Using Big Data
Driven by this challenge, two of NC's highest bioinformatics priorities are the Cancer Genomic Data Commons (GDC) and the Cancer Genomics Cloud Pilots. These are complementary programs that I believe will go a long way towards democratizing access to the data being produced by NCI-supported researchers working in the lab and the clinic. Underlying the NC's approach to these programs are the guiding principles of FAIR data publishing that valuable scientific data should be Findable, Accessible, Interoperable, and Reusable.
The GDC will be a data repository, housing data from TCGA and other NCI-supported efforts, including TARGET a similar program to TCGA focused on pediatric cancers and enable data submissions from individual research groups. The GDC will also provide tools to query and download these data. The data in the GDC, which will launch mid-2016, will be harmonized and will be standards-based, providing consistency and facilitating interoperability. The GDC has developed a flexible data model that allows submitters of data to define up-front what metadata is required to accompany their submissions. The GDC can ensure that the data is shareable and interoperable based on these definitions. In addition, the GDC uses the NCI Metathesaurus for data element definitions wherever possible.
The Cloud Pilots, which will be available early in 2016, will provide the tools/pipelines and compute capacity to analyze these data. Three organizations, selected from a broader group of highly competitive applicants, were awarded contracts in late 2014 to establish individual cloud infrastructures that will provide storage and analytics for the current set of TCGA data. Each group is collaborating with the Global Alliance for Genomics and Health (GA4GH) to define open APIs (Application Programming Interfaces) that will allow researchers to analyze TCGA data using computational tools that are incorporated into the cloud infrastructure. Researchers will also have the ability to deploy their own tools and algorithms directly into the clouds and establish workspaces where they can save their analyses and results. Each of the Cloud Pilots will have different tools and interfaces, and the research community will have the opportunity to evaluate their utility.
The Cloud Pilots will be a proving ground of sorts, allowing us to learn valuable lessons about the optimal design and functionality of a cloud infrastructure that supports access to and analysis of enormous amounts of data. This will be extremely important, because once the GDC is up and running, between TARGET and TCGA alone it will house almost 5 petabytes of data the equivalent of 5 billion gigabytes of data!
With much more data from various NCI-funded genomic projects yet to come, and plans to open up the GDC to the cancer research community to deposit their data, we expect the amount of data in the GDC to swell to the hundreds of petabytes.
The priorities for biomedical informatics are clear: data needs to be stored in a manner that is accessible and reusable, and the infrastructure must support the diverse means of scientific inquiry the research community utilizes to make new discoveries. The NCI investment in biomedical informatics is one we believe will pay large dividends, leading to more rapid research advances and improvements in patient care.
Look for future blog posts to learn more!