Cleaning Data: The Basics

Cleaning Data: The Basics

What is Data Cleaning?

At its most basic level, data cleaning is the process of fixing or removing data that’s inaccurate, duplicated, or outside the scope of your research question.

Some errors might be hard to avoid. You may have made a mistake during data entry, or you might have a corrupt file. You may find the format is wrong for combining multiple data sets or different sources. Or you may have metadata that’s mislabeled.

Before beginning to clean your data, it’s a good idea to keep a copy of the raw data set. If you make an error during the cleaning stage, you can always go back to the original, and you won’t lose important information.

In working with data, remember the three “C”s:

  • Complete—Avoid missing data. You can use default records as stand-ins for incomplete data sets. Or you can recode data using a different format or fill in missing values using a statistic tool. Be sure to use metadata that’s appropriate for the data type and topic.
  • Consistent—Ensure that the data collected at the beginning of the study matches data from the end of the study (in both semantics and scope).
  • Correct—Look for outliers and duplicates. Duplicate records can lead to incorrect calculations and impact statistical results so be sure to delete them. You can identify outliers using statistics (e.g., z scores or box plots). Before removing any outliers, consider the significance of these data and how removal could impact downstream analytics. Sometimes outliers are deceptive, but sometimes they offer insightful information.

Following these three Cs will help you when it comes time to aggregate data and will make filtering, selecting, and calculating more efficient.

Why Do We Need Clean Data for Cancer Research?

Accurate data supports sound decisionmaking, helping you address your research question and allowing you to avoid misleading findings and costly mistakes.

What Do I Need to Know?

Quality data takes effort. Below are some typical areas that can cause problems:

  • Mismatched or incomplete metadata. One of the most common problems occurs when researchers assign the wrong code. You may also find that codes change over time with the release of new versions. NCI Thesaurus can help you assign the correct codes. For more on the importance of semantics in data science, see the blogs, “Semantics Primer,” and “Semantics Series: A Deep Dive Into Common Data Elements.”
  • Inconsistent formatting. Review your formatting and carefully watch for data entry errors. Be sure that the entries exactly match your research, as many errors can occur during data entry. Check your columns to make certain you’ve used the same descriptors consistently. You can drop any columns that aren’t immediately relevant to your research question, and you can split columns as needed (depending on the software program that you’re using). Be sure to keep one entry per cell. You can flag any entries that need more attention (such as checking a patient’s medication history or confirming a date). You can always go back to those problem areas and resolve them when you have more information.
  • Watch for bias. Data bias is another area that can result in misleading conclusions. Personal or societal biases can creep into research, even without your knowledge. It’s difficult to de-bias data during data cleaning. It’s better to think about the research questions you’ll ask and look for ways to offset bias before you collect the data. For example, you might want to recruit a range of study subjects by retooling your informed consent forms and broadening your outreach. You also might need to make adjustments to mitigate algorithm and data collection biases.

Repository Matters

You can maximize your data’s discoverability and re-use by uploading your files to a general or specialty data repository. Repositories serve as archives for data. They may have different data requirements. Some generalist collections allow you to upload a variety of formats and data types whereas specialty collections have very specific guidelines.

After you submit your data to a registry, the repository staff will do the following:

  1. Check your data for errors, inconsistencies, or missing information. Quality control includes regular checks for data completeness, accuracy, and adherence to coding standards.
  2. Validate your data. This may include registrars cross-checking data with multiple sources and/or verifying specific details with healthcare providers.
  3. Ensure your data are correctly linked. Data may be linked with other databases, such as vital records, to gather additional information and ensure comprehensive data capture for each case.
  4. Remove certain patient information. Personal identifiers, which link data to a specific person, are typically removed from the data to protect patient privacy. This is done before it is sent to a repository for  broader distribution.
  5. Check that your data fits the repository’s system. Registries follow standardized coding systems and reporting guidelines to ensure consistency across different regions and over time, allowing for meaningful comparisons and analysis.

Setting up your data correctly from the start can help you avoid delays in formatting when it comes time to deposit your data, especially if your research is NIH funded. NIH’s Data Management and Sharing Policy requires making effective data management and sharing practices a routine part of scientific discovery.

Privacy is Vital

If you’re working with genetic data, imaging data, or other data that includes personal information, you must take steps to ensure patient privacy. The Health Insurance Portability and Accountability Act (HIPAA) requires you remove patients’ personal information.

The Informatics Technology for Cancer Research (ITCR) Program has a course, “Ethical Data Handling for Cancer Research,” that you can take to better understand important ethical principles of data management from a privacy, security, usability, and discoverability perspective.

Documentation is Key

Tracking how you cleaned your data can help save time in the future, reminding you of the types of errors you encountered and the approaches you used to fix those errors. It’s also good to document how you managed outliers.

If you use informatics tools in your research but have not had training in reproducibility tools and methods, take ITCR’s “Intro to Reproducibility in Cancer Informatics” course. You’ll gain skills in writing durable code, making a project open source with GitHub, analyzing documents, and more.

After you’ve completed the introductory course, take “Advanced Reproducibility in Cancer Informatics,” which will teach you more complex GitHub functions, how to engage in code review, how to modify a Docker image, and more.

Reminders to Keep in Mind

  • Plan your data collection efforts well in advance of starting your study, and be sure to keep careful documentation. Doing this will minimize the time-consuming and tedious task of cleaning data.
  • See the article, “Generating and Collecting Data: The Basics” for more tips.
  • Technology also may be able to help lighten your data-cleaning workload. Traditionally, data cleaning has been an arduous task that relied heavily on human decisions. This may be changing, however, as technology helps make some of these decisions. For example, tools, both commercial and open source, are now available that can remove unnecessary columns, filter results, and validate data sets.

NCI Data Cleaning Resources and Initiatives

Now that you have a sense of the basics, use the following resources to discover more about the topic and understand NCI’s investment in this stage of the data science lifecycle.





What do you think of this new training section? Let us know if it’s meeting your needs and what we can do to make it even better!

Share Feedback

Return to the previous stage
Generating and Collecting Data
Continue to the next stage
Exploring and Analyzing Data
Vote below about this page’s helpfulness.

Enter the characters shown in the image.