Cancer Data Science Pulse
Introducing the Data Commons Framework
For this interview, the Center for Biomedical Informatics and Information Technology Communications Team interviewed Dr. Robert L. Grossman of the University of Chicago Center for Data Intensive Science to discuss the Data Commons Framework, a component of the NCI Cancer Research Data Commons.
National Cancer Institute Center for Biomedical Informatics and Information Technology: What is the Data Commons Framework and how is it related to the Cancer Research Data Commons?
Dr. Robert L. Grossman: The Data Commons Framework (DCF) is a set of software services to make it easier to develop, operate, and interoperate data commons, data clouds, knowledge bases, and other resources for managing, analyzing, and sharing cancer research data that are part of the Cancer Research Data Commons (CRDC).
One of the recommendations of the Cancer Moonshot Blue Ribbon Panel was to "build a national cancer data ecosystem" so that all participants "across the cancer research and care continuum" could "contribute, access, combine and analyze diverse data." The CRDC is an important first step towards a cancer data ecosystem.
Two important components of the CRDC are the NCI Genomic Data Commons (GDC), which distributes over 2.5 petabytes (PB) of cancer genomics data, and the NCI Cloud Resources, which provide cloud-based computational resources to analyze data from the GDC and other repositories. The GDC and Cloud Resources were both launched in 2016, and the DCF is based upon the experiences of the developers and users of these projects over the past two years.
There are three Cloud Resources, each with an active community of users: Broad's FireCloud, Seven Bridges' Cancer Genomics Cloud, and the Institute for Systems Biology's Cancer Genomics Cloud. The first use of the DCF is to integrate the Cloud Resources and GDC so that the Cloud Resources always have the latest version of the GDC data and share a common authentication and authorization framework.
In the future, the CRDC will also support resources (called CRDC "nodes") from different domains, such as proteomics and imaging. The DCF will enable the GDC, Cloud Resources, and future CRDC resources to all interoperate.
NCI CBIIT: What kinds of services does the DCF provide?
RLG: DCF services will be made available in stages. Currently, authentication (AuthN) and authorization (AuthZ) services are available, as are services for assigning data objects globally unique IDs (GUIDs) and for storing and accessing data objects in private and public clouds based upon their IDs. Over the next year, services for metadata validation and for working with domain specific, extensible data models will also become available, as will APIs for executing workflows and workspaces to support collaborative projects. The DCF services support making data Findable, Accessible, Interoperable, and Reusable (FAIR).
NCI CBIIT: What powers the DCF?
RLG: The DCF is built using the Gen3 platform that is being developed by the Center for Data Intensive Science at the University of Chicago. The Gen3 data platform organizes data into projects and divides project data into two types: project data objects and project core data. Data objects are assigned GUIDs and can be stored in one or more private and public clouds and accessed using Gen3 DCF services. Project core data can be structured with data models and enriched with controlled vocabularies and ontologies. Gen3 includes AuthN and AuthZ services so that controlled access data can be included in nodes and so collaborative and team science can be supported. Gen3 DCF services also include the ability to define extensible data models, import data using the data model, and query data against the data model. Currently, the Gen3 authentication, authorization, and digital ID services are integrated into the DCF that supports the CRDC. Next year, Gen3 services for working with extensible data models, ontologies, workspaces, and pipeline execution services will also be integrated into the DCF.
NCI CBIIT: How do researchers or developers use the DCF?
RLG: There are several ways to start to use the DCF. First, if a CRDC resource exposes an API, researchers/developers can use the DCF authentication and authorization services and the data commons API to build an application that is powered with data from the data commons. This is possible with the GDC today and will be possible with other data commons as they are added to the CRDC. In particular, applications that access data from multiple CRDC nodes can be developed. Second, if researchers/developers are developing a CRDC node or resource, they can use the services to simplify its development and to integrate it with the CRDC.
NCI CBIIT: Tell us a little about the software architecture of the DCF.
RLG: The DCF is designed to provide core services for what is sometimes called a narrow middle architecture for a data commons or data ecosystem. (See Figure 2.) The idea with this approach, which is the same end-to-end design principle used to build the internet, is to fix and standardize a small number of core services (the DCF core services), while supporting different approaches, for importing, curating, and integrating data (one "end" of the system) and for exploring, analyzing, and sharing data (the other end of the system). With this approach, there can be innovation on both ends of the system without changing the core services, making it easier to evolve the system over time. For more about narrow middle architectures, see my blog post, "Progress Toward Cancer Data Ecosystems."
NCI CBIIT: What are some of the other initiatives that are building data commons and how will the CRDC interact with them?
RLG: There is a lot of activity now building data commons. The National Institutes of Health (NIH) as a whole has a project called the Data Commons Pilot Phase Consortium that includes teams that are working together to build data commons for the Genotype-Tissue Expression (GTEx) and the Trans-Omics for Precision Medicine (TOPMed) datasets and the model organism databases, as well as services defined by seven key capabilities that should enable data commons to interoperate. In addition, some Institutes are building their own data commons or data commons pilots, including the National Heart, Lung, and Blood Institute (NHLBI).
As part of the Data Commons Pilot Phase Consortium, our team will be engaged in identifying approaches to compatibility between the DCF and the counterpart of these services in support of interoperability across these commons. Over time, as a consensus emerges for services around data models and the associated semantic services, services for exploring and analyzing clinical and other structured data across multiple commons will also emerge. This is currently an area of active research and development.
For more information on the NCI Cancer Research Data Commons, read Dr. Allen Dearry's blog post, "Towards a Cancer Research Data Commons."