Cancer Data Science Pulse

Breaking Down Barriers to Sharing Cancer Data—The NIH Generalist Repository Ecosystem Initiative

Advances in biomedical science hinge on the sharing of scientific results, materials, and methods—both to disseminate new findings and to provide materials to other scientists so they can build on these important works. As technology has progressed, so too has our ability to generate data. Likewise, software and computational workflows for data analysis are being developed faster than ever before, increasing our abilities and opportunities for sharing far more than just specific results.

Image of servers and computer code set against a blue background to illustrate high-technology and data sharing.

Traditionally, most of the data sharing-and-archiving platforms have been domain-specific, created to fulfill a specific need. Although relatively new to the cancer research data ecosystem, NCI’s Cancer Research Data Commons (CRDC) is one example of a cloud-based infrastructure that includes domain-specific repositories and knowledgebases with a wide variety of cancer-related data, including genomics, proteomics, imaging, comparative oncology, and more. Through these data resources and more than a thousand tools and workflows, the CRDC is committed to ensuring greater data interoperability and accessibility.

Expanding the NCI Data Ecosystem

Traditionally, most of the data sharing-and-archiving platforms have been domain-specific, typically created to fulfill a specific need.
In an ideal world, most if not all NCI-funded cancer research data would be housed in NCI’s CRDC and, thus, would follow established FAIR data standards (i.e., data that are Findable, Accessible, Interoperable, and Re-usable) to further streamline research.
What happens to cancer data that aren’t part of a domain-specific home? Situations like these have led to the creation of “generalist repositories.”
But what happens to cancer data that aren’t part of a domain-specific home like CRDC? There may be data sets that don’t fit neatly within a single domain or which need to be targeted to a specific publication or funder’s repository.
Situations like these have led to the creation of “generalist repositories” (GRs). GRs store and preserve a wide variety of data types and research outputs and usually accept data regardless of the type, format, content, or disciplinary focus. These repositories include sizeable amounts of cancer data, supported by NCI as well as a broad range of other funders. These data offer tremendous value to the cancer research community, but to date, these GRs haven’t been an integral part of NIH’s data repository ecosystem.
Supporting a seamless data ecosystem is an important goal for NIH, as defined in the
Strategic Plan for Data Science. Such an ecosystem will help ensure that data and other digital objects resulting from NIH-funded research can be easily stored and shared with the research community. Toward this end, and to learn more about GRs and how they fit into the broader data-sharing landscape, NIH launched a pilot project in July 2019. In early 2020, NIH organized a community workshop focusing on the role of GRs in enhancing data discoverability and reuse. In addition, NIH conducted an independent assessment of the GR landscape to determine a place for GRs within the NIH data ecosystem. 

The Generalist Repository Ecosystem Initiative is intended to supplement NIH’s domain-specific repositories. 
All these activities paved the way for exploring specific GR features and how these resources are being used by the NIH community, serving as a catalyst for broader discussions on the role of GRs as a whole. Outcomes from these activities (summarized in an National Library of Medicine blog post) led NIH to develop the Generalist Repository Ecosystem Initiative (GREI). The GREI is intended to supplement NIH’s domain-specific data repositories, especially when researchers are unable to find a specialized repository applicable to their research object.

NIH hopes to give NIH-funded researchers more opportunities to use one or more GRs to share data that adhere to FAIR principles, thus promoting discoverability of data that are robust, reproducible, and re-usable.
A key GREI goal is to incentivize established GRs to work together. By fostering “co-opetition” among GRs, NIH hopes to give NIH-funded researchers more opportunities to use one or more GRs to share data that adhere to FAIR principles, thus promoting discoverability of data that are robust, reproducible, and re-usable.
At present, NIH is working with
six GRs to develop new approaches for finding, accessing, and sharing digital assets. By developing collaborative approaches to data management and sharing, these GRs will be eligible for inclusion in the NIH data ecosystem. The six repositories are as follows:
NIH is working with six GRs
 to develop new approaches for finding, accessing, and sharing digital assets. 

GRs are expected to implement a common set of cohesive and consistent capabilities that comply with NIH’s desirable repository characteristics (see Supplemental information notice to the NIH Policy for Data Management and Sharing). Those associated capabilities include providing metrics and social infrastructure as well as conducting outreach and training for the research community. A secondary aim is to raise the general awareness and upskill researchers to adopt and implement FAIR principles. 

GRs will make it easier for researchers to share their data, use others’ data, and avoid data duplication.
What does this mean for cancer researchers? In short, we want to make it easier for researchers to share their data, use others’ data, and avoid data duplication. Ultimately, the intended outcomes include improving the discoverability of NIH-funded data and reproducibility, and driving new scientific discovery and innovation by facilitating the re-use of all NIH-funded research data.

Changing the Culture

GREI is one of several efforts aimed at improving data discoverability and re-use. The new NIH Data Management and Sharing Policy, published in October 2020, will go into effect on January 25, 2023. This new policy underscores the importance of making effective data management and sharing practices a routine part of scientific discovery. It applies to all NIH-funded research, regardless of the requested budget amounts or funding mechanisms. One of the key aspects of writing a good data management and sharing plan pertains to the selection and use of established data repositories with desirable characteristics.

In that respect, NCI is continuing to build on the CRDC and the cancer data ecosystem as a whole to promote broad and equitable data sharing throughout the cancer research community. The CRDC not only houses NCI’s high value data, but also includes infrastructure resources such as the Cancer Data Aggregator, which allows researchers to query across the CRDC for integrative analysis. In addition, the Data Standards Services is working to ensure interoperability across CRDC to promote new and better ways of combining and analyzing multi-modal cancer data. We want to truly democratize access to cancer data to further accelerate research, not only within NCI, but throughout the scientific community. 

We want to truly democratize access to cancer data to further accelerate research, not only within NCI, but throughout the scientific community.
Ultimately, we hope to change the culture around data management and sharing in the research community. This won’t be easy and will take time, as noted in a recent 2-day National Academies of Sciences, Engineering, and Medicine (NASEM) workshop. (The NASEM archived video is now available.) Following FAIR principles is one component, but data stewardship, patient privacy, data security, costs, and limited resources are all key concerns. Undoubtedly, many challenges are ahead as we move to a more equitable and accessible global data ecosystem. Overcoming these challenges will maximize the benefits of biomedical research and accelerate data-driven cancer discovery.

Ultimately, we hope to change the culture around data management and sharing in the research community.
GREI is an important step in expanding the data ecosystem, opening additional venues for data sharing. The conceptualization of this initiative is a direct result of input from the research community and other stakeholders. Undoubtedly, in time, more repositories will be added to the existing ecosystem. (Indeed, two trans-NIH funding opportunities address this ever-growing need for biomedical repositories and their knowledgebases.)
It’s vital that we continue our dialog if we are to truly change the culture of data management and sharing to accelerate data-driven research.

Join us for the GREI Collaborative Webinar Series on Data Sharing in Generalist Repositories, a series of presentations and panel discussions about available repository resources and best practices for sharing NIH-funded research.
 
Chief, Scientific Policy and Program Branch, NCI Office of Data Sharing
Ishwar Chandramouliswaran, M.S., M.B.A.
Program Director, NIH Office of Data Science Strategy
Older Post
Performing a CIViC Duty—A Community-Driven Resource for Interpreting Data on Cancer Variants
Newer Post
Your Guide to NCI Data Science Resources for Childhood Cancer Research

Leave a Reply

Vote below about this page’s helpfulness.

Your email address will not be published.