Generating and Collecting Data: The Basics

Generating and Collecting Data: The Basics

What is Data Generation and Collection, and What’s the Difference Between the Two?

Generating data means producing data. Your research is underway, and you’re actively producing data to address your research question. Collecting data means you’re putting the data you’ve gathered into a format that then allows you to analyze that data and share your findings with others. (To learn more about these additional stages of the data science lifecycle process, refer to the “Learn About Cancer Data Science” webpage).

Why is Generating and Collecting Data Important for Cancer Research?

In cancer research, we’re using data to look for new and better ways of diagnosing, preventing, treating, and tracking cancer. Researchers around the world are generating and collecting petabytes of data, including genomics, proteomics, imaging, clinical, epidemiological, and more!

Cancer Data Registries

It’s important to know that, by law, facilities collecting data on new cancer cases need to report those cases to a central cancer registry, such as a state registry. When submitting to a cancer registry, you must meet specific requirements to capture important cancer data for every case. Your state cancer data registry will require information that might include histology findings, primary tumor site, and more. Registries also require data to be reported within a certain timeframe, such as within 95 days of diagnosis.

Cancer registry data collection generally includes the following:

  1. Reporting sources: Various sources—including hospitals, pathology laboratories, radiation centers, outpatient facilities, and physicians’ offices—report cancer cases to the registry.
  2. Collecting data elements: Registries collect a standardized set of data elements, such as patient demographics (e.g., age, sex, race/ethnicity), tumor information (e.g., site, stage, grade, subtypes of cancer), treatment details (e.g., surgery, chemotherapy, radiation therapy), and follow-up data (e.g., survival).
  3. Abstracting information: Trained registrars review patients’ medical records and abstract relevant information into the registry database. To ensure consistency, the registrars use coding systems, such as the International Classification of Diseases for Oncology and the Surveillance, Epidemiology, and End Results (SEER) Program Coding and Staging Manual.

What Do I Need to Know?

Fundamental Tips for Effective Data Generation and Collection

  • Start with a plan. You’ll be successful and avoid unexpected costs and delays by documenting how you will use and collect your data.
    • Examine your research question: What do you want to know? What data do you need to address the question?
    • Consider contextual variables that could influence your research question. For example, when researching a patient response to a particular medication, don’t forget the importance of the person’s ethnicity, race, gender, age, family history, and geographic location.
    • Reduce the potential for bias by selecting a wide variety of study subjects and diverse representation.
  • Meet the necessary legal and ethical requirements.
    • Adhere to all privacy and data confidentiality requirements. Informed consent should specifically address how you will collect and use data, including encouraging secondary research use.
    • Determine if a data sharing agreement is needed.
    • Determine where your data will be stored and be sure to meet any state cancer data registry requirements.
  • Select a format and make sure you adhere to it.
    • Your format should apply to your current study but also be congruous with other existing data sets. By setting up your study for sharing right from the start, you won’t need to spend as much time in the data cleaning stages.
    • Assign appropriate descriptors to variables based on accepted semantics. Selecting descriptors that are well known and accepted in the field means your metadata will be a good match with other data sets, making data discovery easier and more efficient.
    • Determine how you will structure your data to make sorting, filtering, and analyzing most efficient. For example, what headings you will use (e.g., patient ID, wave or time variable, measure, etc.).
    • Identify who will input data and make sure everyone responsible for inputting data uses the same format.
  • Avoid common pitfalls. You can correct many errors through the process of data cleaning (see related how-to article, “Cleaning Data: The Basics”). With careful planning, you can prevent some of the problems below, or at least address then sooner, before they impact your research:
    • Missing data fields
    • Duplicated data
    • More than one entry per field
    • Misnamed categories
    • Incorrectly labeled data

NCI Data Generation and Collection Resources and Initiatives

Now that you have a sense of the basics, use the following resources to discover more about the topic and understand NCI’s investment in this stage of the data science lifecycle.

Resources and Tools

  • NCI Thesaurus (NCIt): This reference is a widely recognized standard for biomedical coding. Use it to assign codes for your variables.
  • Repositories: The NCI Data Catalog lists data collections produced by major NCI initiatives and other widely used data sets.

All research funded by NIH needs to adhere to the latest Data Management and Sharing Policy, which went into effect in January 2023.

Blogs

Projects

Publications

Updated:
Continue to the next stage
Cleaning Data
Vote below about this page’s helpfulness.