Generating and Collecting Data: The Basics

What is Data Generation and Collection, and What’s the Difference Between the Two?

Generating data means producing data. Your research is underway, and you’re actively producing data to address your research question. Collecting data means you’re putting the data you’ve gathered into a format that then allows you to analyze that data and share your findings with others. (To learn more about these additional stages of the data science lifecycle process, refer to the “Learn About Cancer Data Science” webpage).

Why is Generating and Collecting Data Important for Cancer Research?

In cancer research, we’re using data to look for new and better ways of diagnosing, preventing, treating, and tracking cancer. Researchers around the world are generating and collecting petabytes of data, including genomics, proteomics, imaging, clinical, epidemiological, and more!

Cancer Data Registries

It’s important to know that, by law, facilities collecting data on new cancer cases need to report those cases to a central cancer registry, such as a state registry. When submitting to a cancer registry, you must meet specific requirements to capture important cancer data for every case. Your state cancer data registry will require information that might include histology findings, primary tumor site, and more. Registries also require data to be reported within a certain timeframe, such as within 95 days of diagnosis.

Cancer registry data collection generally includes the following:

Reporting sources: Various sources—including hospitals, pathology laboratories, radiation centers, outpatient facilities, and physicians’ offices—report cancer cases to the registry.
Collecting data elements: Registries collect a standardized set of data elements, such as patient demographics (e.g., age, sex, race/ethnicity), tumor information (e.g., site, stage, grade, subtypes of cancer), treatment details (e.g., surgery, chemotherapy, radiation therapy), and follow-up data (e.g., survival).
Abstracting information: Trained registrars review patients’ medical records and abstract relevant information into the registry database. To ensure consistency, the registrars use coding systems, such as the International Classification of Diseases for Oncology and the Surveillance, Epidemiology, and End Results (SEER) Program Coding and Staging Manual.

What Do I Need to Know?

Fundamental Tips for Effective Data Generation and Collection

Start with a plan. You’ll be successful and avoid unexpected costs and delays by documenting how you will use and collect your data.
- Examine your research question: What do you want to know? What data do you need to address the question?
- Consider contextual variables that could influence your research question. For example, when researching a patient response to a particular medication, don’t forget the importance of the person’s ethnicity, race, gender, age, family history, and geographic location.
- Reduce the potential for bias by selecting a wide variety of study subjects and diverse representation.
Meet the necessary legal and ethical requirements.
- Adhere to all privacy and data confidentiality requirements. Informed consent should specifically address how you will collect and use data, including encouraging secondary research use.
- Determine if a data sharing agreement is needed.
- Determine where your data will be stored and be sure to meet any state cancer data registry requirements.
Select a format and make sure you adhere to it.
- Your format should apply to your current study but also be congruous with other existing data sets. By setting up your study for sharing right from the start, you won’t need to spend as much time in the data cleaning stages.
- Assign appropriate descriptors to variables based on accepted semantics. Selecting descriptors that are well known and accepted in the field means your metadata will be a good match with other data sets, making data discovery easier and more efficient.
- Determine how you will structure your data to make sorting, filtering, and analyzing most efficient. For example, what headings you will use (e.g., patient ID, wave or time variable, measure, etc.).
- Identify who will input data and make sure everyone responsible for inputting data uses the same format.
Avoid common pitfalls. You can correct many errors through the process of data cleaning (see related how-to article, “Cleaning Data: The Basics”). With careful planning, you can prevent some of the problems below, or at least address then sooner, before they impact your research:
- Missing data fields
- Duplicated data
- More than one entry per field
- Misnamed categories
- Incorrectly labeled data

NCI Data Generation and Collection Resources and Initiatives

Now that you have a sense of the basics, use the following resources to discover more about the topic and understand NCI’s investment in this stage of the data science lifecycle.

Want to keep up with the latest news? Subscribe for updates on NCI data generation/collection projects, events, and trainings.

Resources and Tools

NCI Thesaurus (NCIt): This reference is a widely recognized standard for biomedical coding. Use it to assign codes for your variables.
Repositories: The NCI Data Catalog lists data collections produced by major NCI initiatives and other widely used data sets.

All research funded by NIH needs to adhere to the latest Data Management and Sharing Policy, which went into effect in January 2023.

Blogs

Semantics Series: A Deep Dive Into Common Data Elements: Learn about the importance of using accurate descriptors in data science.

Projects

NCI’s SEER Program has a training site with modules to help with collecting and recording cancer data. SEER also offers resources with links to reference materials and organizations that can help with coding and registering cancer cases.
The Cancer Research Data Commons offers a wide range of support to researchers—including tutorials, user guides, and office hours—to help them learn to use this cloud-based collection of data sets, accessible through its data commons or cloud resources, that also make thousands of analytical tools available.
NCI’s Informatics Technology for Cancer Research (ITCR) Program has training courses available via the ITCR Training Network.

Publications

NCI Imaging Data Commons. Cancer Research, 2021. | Learn about the Imaging Data Commons, one of the data commons in NCI’s Cancer Research Data Commons, along with tools and resources for managing imaging data.
Implementing the FAIR Data Principles in Precision Oncology: Review of Supporting Initiatives. Briefings in Bioinformatics, 2020. | See this systematic review of initiatives that follow FAIR data principles, as well as best practices for supporting data interoperability and reusability.
Insights from Adopting a Data Commons Approach for Large-scale Observational Cohort Studies: The California Teachers Study. Cancer Epidemiology, Biomarkers & Prevention, 2020. | Learn about a scalable, cloud-based infrastructure for managing data, security/access, metadata, and analytical tools.
Reliable Analysis of Clinical Tumor-Only Whole-Exome Sequencing Data. JCO Clinical Cancer Informatics, 2020. | Learn about this open-source option for analyzing comprehensive allele-specific copy number alterations and classifying single nucleotide variants from tumor-only whole exome sequencing.
Data Lakes, Clouds, and Commons: A Review of Platforms for Analyzing and Sharing Genomic Data. Trends in Genetics, 2019. | See samples of software platforms for managing, analyzing, and sharing genomic data, as well as the role of data ecosystems and data lakes.
Recommendations for the Collection and Use of Multiplexed Functional Data for Clinical Variant Interpretation. Genome Medicine, 2019. | Learn about the existing guidelines and recommendations for bridging the gap between multiplexed functional data and existing clinical variant-interpretation frameworks. See how to combine multiplexed functional data with other sources of evidence for more meaningful clinical interpretation.
restfulSE: A Semantically Rich Interface for Cloud-scale Genomics with Bioconductor. F1000 Research, 2019. | Learn about Bioconductor and how to use this tool to target queries for large remote genomic data resources.
Data Harmonization for a Molecularly Driven Health System. Cell, 2018. | See this commentary for more information on the critical role of data harmonization and how data commons can help support the usability, interoperability, and quality of cancer data.
Reengineering Workflow for Curation of DICOM Datasets. Journal of Digital Imaging, 2018. | Learn more about the “Posda Tools” and how researchers used them in a unique way to facilitate rapid analysis of DICOM-formatted data.
Using Semantic Web Technologies to Enable Cancer Genomics Discovery at Petabyte Scale. Cancer Informatics, 2018. | See how to use the Semantic Web-based Data Browser, a tool that allows users to visually build and execute ontology-driven queries.

Ready to start your project? Get an overview of the data science lifecycle and what you should do in each stage.
Want to learn the basic skills for cancer data science? Check out our basics skills video course.
Need answers to data science questions? Visit our Training Guide Library.

Updated: Oct 20, 2023

Generating and Collecting Data: The Basics