Cancer Data Science Pulse
Breaking Down Barriers to Sharing Cancer Data—The NIH Generalist Repository Ecosystem Initiative
Advances in biomedical science hinge on the sharing of scientific results, materials, and methods—both to disseminate new findings and to provide materials to other scientists so they can build on these important works. As technology has progressed, so too has our ability to generate data. Likewise, software and computational workflows for data analysis are being developed faster than ever before, increasing our abilities and opportunities for sharing far more than just specific results.
Traditionally, most of the data sharing-and-archiving platforms have been domain-specific, created to fulfill a specific need. Although relatively new to the cancer research data ecosystem, NCI’s Cancer Research Data Commons (CRDC) is one example of a cloud-based infrastructure that includes domain-specific repositories and knowledgebases with a wide variety of cancer-related data, including genomics, proteomics, imaging, comparative oncology, and more. Through these data resources and more than a thousand tools and workflows, the CRDC is committed to ensuring greater data interoperability and accessibility.
Expanding the NCI Data Ecosystem
Situations like these have led to the creation of “generalist repositories” (GRs). GRs store and preserve a wide variety of data types and research outputs and usually accept data regardless of the type, format, content, or disciplinary focus. These repositories include sizeable amounts of cancer data, supported by NCI as well as a broad range of other funders. These data offer tremendous value to the cancer research community, but to date, these GRs haven’t been an integral part of NIH’s data repository ecosystem.
Supporting a seamless data ecosystem is an important goal for NIH, as defined in the Strategic Plan for Data Science. Such an ecosystem will help ensure that data and other digital objects resulting from NIH-funded research can be easily stored and shared with the research community. Toward this end, and to learn more about GRs and how they fit into the broader data-sharing landscape, NIH launched a pilot project in July 2019. In early 2020, NIH organized a community workshop focusing on the role of GRs in enhancing data discoverability and reuse. In addition, NIH conducted an independent assessment of the GR landscape to determine a place for GRs within the NIH data ecosystem.
At present, NIH is working with six GRs to develop new approaches for finding, accessing, and sharing digital assets. By developing collaborative approaches to data management and sharing, these GRs will be eligible for inclusion in the NIH data ecosystem. The six repositories are as follows:
GRs are expected to implement a common set of cohesive and consistent capabilities that comply with NIH’s desirable repository characteristics (see “Supplemental information notice to the NIH Policy for Data Management and Sharing”). Those associated capabilities include providing metrics and social infrastructure as well as conducting outreach and training for the research community. A secondary aim is to raise the general awareness and upskill researchers to adopt and implement FAIR principles.
Changing the Culture
GREI is one of several efforts aimed at improving data discoverability and re-use. The new NIH Data Management and Sharing Policy, published in October 2020, will go into effect on January 25, 2023. This new policy underscores the importance of making effective data management and sharing practices a routine part of scientific discovery. It applies to all NIH-funded research, regardless of the requested budget amounts or funding mechanisms. One of the key aspects of writing a good data management and sharing plan pertains to the selection and use of established data repositories with desirable characteristics.
In that respect, NCI is continuing to build on the CRDC and the cancer data ecosystem as a whole to promote broad and equitable data sharing throughout the cancer research community. The CRDC not only houses NCI’s high value data, but also includes infrastructure resources such as the Cancer Data Aggregator, which allows researchers to query across the CRDC for integrative analysis. In addition, the Data Standards Services is working to ensure interoperability across CRDC to promote new and better ways of combining and analyzing multi-modal cancer data. We want to truly democratize access to cancer data to further accelerate research, not only within NCI, but throughout the scientific community.
It’s vital that we continue our dialog if we are to truly change the culture of data management and sharing to accelerate data-driven research.