Sharing Data: The Basics
What is Data Sharing?
NIH expects that you make scientific data as widely and freely available as possible to facilitate re-use while safeguarding the privacy of patients and protecting confidential and proprietary data. Data sharing holds immense value in the scientific research field, enhancing your career as a successful scientist by providing recognition and credit for researchers’ work.
Why is Data Sharing Important for Cancer Research?
Sharing is particularly important for unique data that cannot be readily replicated or are difficult to generate. NCI sees data sharing as vital for scientific progress, aiding in research validity and data accessibility, promoting data combination and reuse, and ultimately accelerating the pace of biomedical discoveries.
What Do I Need to Know?
Fundamental Tips for Effective Data Management and Sharing
To practice good data management and allow for efficient data sharing when it’s time for you to share your data, here are some tips:
- Organize your data so it can be readily accessed by you, your colleagues, and anyone that may need to utilize your data in your absence.
- Document the data type and format used when generating data. Be aware of data types and formats relevant to your research area.
- Example data types generated in cancer research include genomics and other omic data; imaging data; epidemiology/population-related data; pre-clinical data; biochemical data; immunological data; and clinical data
- Save data in a standardized format. Each data type may have different file formats. The NCI Genomic Data Commons (GDC) provides a list of file formats and templates for molecular characterization data types. When possible, the file(s) should be in a non-proprietary format (such as .txt, .jpeg) and not in proprietary formats (.xls, .doc, etc.). This gives those who use your data flexibility, because they can then use the data independently of any software platform. It may be helpful to find information about data formats and standards by consulting resources at your institution. You can visit the library, other investigators, shared core facilities, or consult external resources such as scientific journals, data repositories, and international standards bodies, including the GDC.
- Create informative file names so data users can understand the content and data type. File names should be specific enough to not clash with future or unrelated files. Avoid spaces and special characters.
- Store your data in a safe and secure location, like a server or backup system. Your institution may offer free access to commercial cloud-based backup systems. Avoid using flash drives or desktops/laptops, as these are not easily shareable and can be damaged or lost.
- Record your metadata in a timely manner so anyone interpreting the data can reuse and re-analyze your data with ease. Metadata can include experimental methods or procedures, data labels, variable definitions, and any other information necessary to understand and reproduce the conditions in which your data were generated.
- Plan ahead! You can easily maintain your efforts if plans are made ahead of time to consider data management throughout the life of the research project.
Data Sharing Expectations
Expectations around data sharing have evolved, and the culture is moving toward a standard for broad sharing of scientific data generated by research activities. You may have already observed or been involved in sharing data among individual collaborators or large collaborative groups or consortiums. However, it is important to engage in broad data sharing when engaging with the larger scientific community and the public (as this ensures the maximum benefit for all involved).
In short, keep these definitions in mind:
- Collaborator Sharing: Sharing upon publication or request to an author between investigators. This only helps the individual.
- Consortium Sharing: Sharing within large collaborative groups (e.g., collaborative networks/programs). This only benefits a focused group.
- Broad Sharing: Sharing with larger research communities, institutions, and the broader public. This helps the community and ensures fair and equitable data access.
Data Sharing In Practice
What data do I have to share?
You must share all scientific data necessary to reproduce your findings, which can include:
- primary data sets (i.e., generated by original work),
- secondary data sets (i.e., generated by re-use of primary data sets),
- qualitative data (e.g., from social and behavioral data sets), and
- data from fundamental basic science techniques to validate and replicate research findings (e.g., western blots, electrophoresis gels, flow cytometry).
For a list of examples that are not considered scientific data by NIH, see “Research Covered Under the Data Management & Sharing Policy.”
Your funding opportunities may have additional expectations for what and how data should be shared.
How do I share?
You should share your findings in a public and accessible repository. For certain programs and data types, NIH/NCI policy may specify designated data repositories for use.
Here's what you should consider when selecting a repository:
- Data Type: You should select the repository that is most appropriate for your data type and discipline. If your data set or project includes multiple data types or includes a data type not accepted by data type-specific repositories, you can submit to generalist repositories.
- Data Security: When sharing your data in a public repository, consider factors such as protecting and assuring the confidentiality and privacy of all participants, as well as the size and complexity of the data set.
- Data Access: The two general categories of data shared in repositories are:
- Public access data—Data made publicly available to everyone without access restrictions. NIH examples include Gene Expression Omnibus and GenBank.
- Controlled access data—Data made available for secondary research only after investigators have obtained approval to use the requested data for a particular project. Access to controlled data in the Database of Genotypes and Phenotypes will be granted by an NIH Data Access Committee. Consult this instructional video and tips document to see how to make a request.
- FAIR Data Standards: Share your data in public repositories that adhere to the FAIR (Findable, Accessible, Interoperable, Reusable) data principles. Some repositories provide a unique persistent digital identifier for a submitted data set, such as a DOI, so others can easily find your data set.
- Data Preservation and Availability: You should consider relevant requirements and expectations (e.g., repository, award, journal, and institutional requirements) as guidance for the duration for which scientific data must be preserved and made available. Please keep all of these factors in mind when selecting a repository to store and make it accessible for others to use.
When do I share?
You should share your data as soon as reasonably possible!
However, you will need to coordinate with your principal investigator, institution, and the NIH program officer who oversees your grant funding.
For example, the Data Management and Sharing policy states scientific data should be shared by the earlier of two points in your research:
- when you publish, or
- when your funding ends (specifically, the funding that supported data generation for your research project).
Review the data sharing policies that might impact your timeline.
NCI Data Sharing Resources and Initiatives
Now that you have a sense of the basics, use the following resources to discover more about the topic and understand NCI’s investment in this stage of the data science lifecycle.
Resources and Tools
- National Cancer Plan: Discover how maximizing data utility is one of the eight goals of NCI’s comprehensive framework. Data sharing is central to NCI’s mission to lead, conduct, and support cancer research nationwide to advance scientific knowledge and improve lives.
- Data Sharing: Check out this section of our website for sharing-oriented information on policies, genomic data preparation, and more!
- NCI Bioinformatics Training and Education Program Seminar: Watch a recording and learn how to keep your data FAIR.
Blogs
- Breaking Down Barriers to Sharing Cancer Data—The NIH Generalist Repository Ecosystem Initiative: Discover how NIH is working to make generalist repositories (GRs) part of the data sharing ecosystem. The goal is to minimize sharing barriers while still taking advantage of GR convenience and usability.
- Data Sharing Advocacy—How a Cancer Survivor Seeks to Enhance Data Sharing to Better the Patient Experience: Read this personal testimony from Mr. Steve Friedman—a cancer survivor and NCI employee—who has witnessed firsthand the power of data science and sharing tools.
- Semantics Primer: Get the basics on cancer research “semantic” terminology, its influence on data interoperability, and why that’s so critical.
- Semantics Series—A Deep Dive Into Common Data Elements: Continue to learn about semantic terminology. In this blog, you’ll learn what a CDE is and why researchers need them.
- Your Guide to the 2023 NIH Data Management and Sharing Policy: Read it if you’re an NCI-funded investigator. You’ll learn what has changed from the 2003 policy, what you need to do, and where you can find help.
Projects
- Projects/programs that strive for efficient and effective data sharing include:
- NCI’s ITCR Program has training courses available via the ITCR Training Network. The course, “Ethical Data Handling for Cancer Research” has tips on data privacy, security, sharing and ethics.
Additional Data Sharing Resources
- Visit the NIH Scientific Data Sharing Website for a full list of NIH-supported repositories.
- Watch training modules on enhancing data reproducibility from NIH.
- Applying for NIH funding? See what you need to include in your DMS Plan.
- Attend an NIH Data Sharing and Reuse Seminar Series, and see how other researchers are finding ways to reuse their data or generate new findings from other data sets.
- Explore the NIH Common Data Elements (CDEs) Repository where you can browse NIH-endorsed CDEs and Forms for standardizing your data. Live and on-demand trainings on CDEs and how to search this repository are also available.
- Read this blog on health data standards from NIH’s National Library of Medicine.
- Ready to start your project? Get an overview of the data science lifecycle and what you should do in each stage.
- Want to learn the basic skills for cancer data science? Check out our basics skills video course.
- Need answers to data science questions? Visit our Training Guide Library.