Cancer Data Science Pulse
Towards a Cancer Research Data Commons
I recently joined NCI to help support strategic data sharing and informatics projects within the Center for Biomedical Informatics and Information Technology (CBIIT). Having worked on information management at another Institute for five years and the trans-NIH Big Data to Knowledge (BD2K) initiative since its inception, this is an exciting opportunity for me to continue to contribute to enhancing data science across the biomedical community. I have seen first-hand that the ability to access, analyze, and share research data is an imperative if we are to accelerate progress in prevention and treatment of diseases like cancer. Recent announcements of the Precision Medicine Initiative (PMI) and the Beau Biden Cancer Moonshot (Moonshot) further emphasized the importance of these goals and created additional drivers towards meeting them.
The Moonshot was announced in 2016 with the objective of accelerating cancer research and breaking down impediments to progress in the development of new treatments. The initiative specifically called out the need to enhance data access and facilitate collaborations among all stakeholders, from basic researchers to clinicians to patients. A Blue Ribbon Panel (BRP), formed to make recommendations for how this ambitious goal could be achieved, recommended development of an infrastructure to support the components of a National Cancer Data Ecosystem. This knowledge network would "collect, share, and interconnect a broad array of large datasets so that researchers, clinicians, and patients will be able to both contribute and analyze data, facilitating discovery that will ultimately improve patient care and outcomes." The Ecosystem will also support the goals of the PMI and the "All of Us" research initiative, which include responsible data sharing, access, and use, to develop individualized treatments for each patient and improve overall outcomes. NCI will play an important role in supporting development of such an ecosystem, providing components that allow for access to and sharing of consistent and harmonized cancer research data.
To that end, NCI has launched a series of initiatives to create an NCI Cancer Research Data Commons (CRDC), which will become a key contribution to the ecosystem described by the BRP. The Cancer Genomics Cloud Pilots, which are transitioning into ongoing NCI Cloud Resources, enable access and analysis of large-scale cancer genomic data. The Cloud Resources have been the subject of several posts on this blog (listed below), and are one component of a NCRDC. In parallel, the Genomic Data Commons (GDC) was launched last year, providing secure access to harmonized genomic data from NCI-sponsored programs. NCI is leveraging investment in both these programs to create a Data Commons Framework, comprising the key functionality of the Cloud Resources and the secure data access of the GDC. The Framework will provide components required to stand up and maintain a Data Commons "node" - one branch of a Commons that stores a set of related data - including:
- secure user authentication and authorization
- metadata validation tools
- an approach allowing development of consistent domain-specific data models
- an API and container environment for tools and pipelines
- access to elastic compute resources
- workspaces for storing data, tools and results and for collaboration among researchers.
Data will be mirrored in multiple commercial clouds, for redundancy, stability, and scalability. Additionally, the Cloud Resources will continue to provide environments to analyze these data using hosted or user-provided analysis tools and pipelines. The vision of the NCI Cancer Research Data Commons is one that contains multiple nodes, with researchers, tool developers, clinicians, and patients contributing and accessing tools and data.
Development of the NCI Cancer Research Data Commons presents several challenges that we'll need to be mindful of if we are to be successful.
First, the scale of what we are trying to do is big. Cancer research is vast and complex, the variety of tools and approaches for analyzing the data are wide-ranging, and the amount of biomedical data being generated is staggering. Add to that the lack of consistency and standardization of the data, and it would be easy to bite off more than we can chew, or get lost in the details. Our approach is to focus on leveraging what we have already developed in the GDC and the Cloud Pilots, and incrementally add data, tools, and infrastructure to solidify those into a platform for moving forward.
Second, the variety of data types presents a similar issue - what kind of data should we make available first, and where should it be located? With that in mind, NCI is working on several new Commons nodes, modeled on the GDC and using the new Framework - specifically, Imaging, Proteomics, and Immuno-oncology. Each of these data types is at a different level of maturity in terms of data standards, which provides an excellent opportunity to apply the framework to new domains and test their effectiveness so we can iterate for future nodes.
All of us at NCI are cognizant of the many different activities in the cancer research community which are related to our concept of a Data Commons. The Framework represents one component of a larger data ecosystem and therefore needs to be flexible, open, and "plug and play" as much as possible, so that usability and interoperability emerge as key features of the infrastructure. NCI will be holding workshops with the biomedical community in the near future to engage in a dialog around planning for a Data Commons and to move toward collaborative solutions. If you have thoughts and opinions to share now, I welcome your feedback at the link below. One thing has become abundantly clear in the world of cancer research - the ability to share data readily across institutional and domain boundaries is an absolute necessity. I am certain that as we come together as a community, we will progress towards making that a reality. The Cancer Research Data Commons is one step in that direction.