Cleaning Data: The Basics
What is Data Cleaning?
At its most basic level, data cleaning is the process of fixing or removing data that’s inaccurate, duplicated, or outside the scope of your research question.
Some errors might be hard to avoid. You may have made a mistake during data entry, or you might have a corrupt file. You may find the format is wrong for combining multiple data sets or different sources. Or you may have metadata that’s mislabeled.
Before beginning to clean your data, it’s a good idea to keep a copy of the raw data set. If you make an error during the cleaning stage, you can always go back to the original, and you won’t lose important information.
In working with data, remember the three “C”s:
- Complete—Avoid missing data. You can use default records as stand-ins for incomplete data sets. Or you can recode data using a different format or fill in missing values using a statistic tool. Be sure to use metadata that’s appropriate for the data type and topic.
- Consistent—Ensure that the data collected at the beginning of the study matches data from the end of the study (in both semantics and scope).
- Correct—Look for outliers and duplicates. Duplicate records can lead to incorrect calculations and impact statistical results so be sure to delete them. You can identify outliers using statistics (e.g., z scores or box plots). Before removing any outliers, consider the significance of these data and how removal could impact downstream analytics. Sometimes outliers are deceptive, but sometimes they offer insightful information.
Following these three Cs will help you when it comes time to aggregate data and will make filtering, selecting, and calculating more efficient.
Why Do We Need Clean Data for Cancer Research?
Accurate data supports sound decision making, helping you address your research question and allowing you to avoid misleading findings and costly mistakes.
What Do I Need to Know?
Quality data takes effort. Below are some typical areas that can cause problems:
- Mismatched or incomplete metadata. One of the most common problems occurs when researchers assign the wrong code. You may also find that codes change over time with the release of new versions. NCI Thesaurus can help you assign the correct codes. For more on the importance of semantics in data science, see the blogs, “Semantics Primer,” and “Semantics Series: A Deep Dive Into Common Data Elements.”
- Inconsistent formatting. Review your formatting and carefully watch for data entry errors. Be sure that the entries exactly match your research, as many errors can occur during data entry. Check your columns to make certain you’ve used the same descriptors consistently. You can drop any columns that aren’t immediately relevant to your research question, and you can split columns as needed (depending on the software program that you’re using). Be sure to keep one entry per cell. You can flag any entries that need more attention (such as checking a patient’s medication history or confirming a date). You can always go back to those problem areas and resolve them when you have more information.
- Watch for bias. Data bias is another area that can result in misleading conclusions. Personal or societal biases can creep into research, even without your knowledge. It’s difficult to de-bias data during data cleaning. It’s better to think about the research questions you’ll ask and look for ways to offset bias before you collect the data. For example, you might want to recruit a range of study subjects by retooling your informed consent forms and broadening your outreach. You also might need to make adjustments to mitigate algorithm and data collection biases.
Repository Matters
You can maximize your data’s discoverability and re-use by uploading your files to a general or specialty data repository. Repositories serve as archives for data. They may have different data requirements. Some generalist collections allow you to upload a variety of formats and data types whereas specialty collections have very specific guidelines.
After you submit your data to a registry, the repository staff will do the following:
- Check your data for errors, inconsistencies, or missing information. Quality control includes regular checks for data completeness, accuracy, and adherence to coding standards.
- Validate your data. This may include registrars cross-checking data with multiple sources and/or verifying specific details with healthcare providers.
- Ensure your data are correctly linked. Data may be linked with other databases, such as vital records, to gather additional information and ensure comprehensive data capture for each case.
- Remove certain patient information. Personal identifiers, which link data to a specific person, are typically removed from the data to protect patient privacy. This is done before it is sent to a repository for broader distribution.
- Check that your data fits the repository’s system. Registries follow standardized coding systems and reporting guidelines to ensure consistency across different regions and over time, allowing for meaningful comparisons and analysis.
Setting up your data correctly from the start can help you avoid delays in formatting when it comes time to deposit your data, especially if your research is NIH funded. NIH’s Data Management and Sharing Policy requires making effective data management and sharing practices a routine part of scientific discovery.
Privacy is Vital
If you’re working with genetic data, imaging data, or other data that includes personal information, you must take steps to ensure patient privacy. The Health Insurance Portability and Accountability Act (HIPAA) requires you remove patients’ personal information.
The Informatics Technology for Cancer Research (ITCR) Program has a course, “Ethical Data Handling for Cancer Research,” that you can take to better understand important ethical principles of data management from a privacy, security, usability, and discoverability perspective.
Documentation is Key
Tracking how you cleaned your data can help save time in the future, reminding you of the types of errors you encountered and the approaches you used to fix those errors. It’s also good to document how you managed outliers.
If you use informatics tools in your research but have not had training in reproducibility tools and methods, take ITCR’s “Intro to Reproducibility in Cancer Informatics” course. You’ll gain skills in writing durable code, making a project open source with GitHub, analyzing documents, and more.
After you’ve completed the introductory course, take “Advanced Reproducibility in Cancer Informatics,” which will teach you more complex GitHub functions, how to engage in code review, how to modify a Docker image, and more.
Reminders to Keep in Mind
- Plan your data collection efforts well in advance of starting your study, and be sure to keep careful documentation. Doing this will minimize the time-consuming and tedious task of cleaning data.
- See the article, “Generating and Collecting Data: The Basics” for more tips.
- Technology also may be able to help lighten your data-cleaning workload. Traditionally, data cleaning has been an arduous task that relied heavily on human decisions. This may be changing, however, as technology helps make some of these decisions. For example, tools, both commercial and open source, are now available that can remove unnecessary columns, filter results, and validate data sets.
NCI Data Cleaning Resources and Initiatives
Now that you have a sense of the basics, use the following resources to discover more about the topic and understand NCI’s investment in this stage of the data science lifecycle.
Blogs
- Semantics Series: A Deep Dive Into Common Data Elements: Learn how using proper descriptors can help you prepare your data for analysis.
Projects
- NCI’s Surveillance, Epidemiology, and End Results (SEER) Program has a training site with modules to help with collecting and recording cancer data. SEER also offers resources with links to reference materials and organizations that can help with coding and registering cancer cases.
- The NCI Cancer Research Data Commons offers a wide range of support to researchers—including tutorials, user guides, and office hours—to help them learn to use this cloud-based collection of data sets, accessible through its data commons or cloud resources, that also make thousands of analytical tools available.
Publications
- Interoperable Slide Microscopy Viewer and Annotation Tool for Imaging Data Science and Computational Pathology. Nature Communications, 2023. | Learn about Slim, an open-source, web-based slide microscopy viewer that helps facilitate interoperability with a range of existing medical imaging systems.
- Effects of Slide Storage on Detection of Molecular Markers by IHC and FISH in Endometrial Cancer Tissues From a Clinical Trial: An NRG Oncology/GOG Pilot Study. Applied Immunohistochemistry & Molecular Morphology, 2022. | See a study that showed that although it’s feasible to use aged-stored slides for identifying biomarkers for cancer, the results may modestly underestimate the true values in endometrial cancer.
- Uniform Genomic Data Analysis in the NCI Genomic Data Commons. Nature Communications, 2021. | Learn about the pipelines and workflows used to process and harmonize data in NCI’s Genomic Data Commons.
- Robustness Study of Noisy Annotation in Deep Learning Based Medical Image Segmentation. Physics in Medicine and Biology, 2020. | See a study showing that a deep network trained with noisy labels is inferior to that trained with reference annotation.
- Screen Technical Noise in Single Cell RNA Sequencing Data. Genomics, 2020. | Learn about a new data cleaning pipeline for single cell RNA-seq data.
- Building Portable and Reproducible Cancer Informatics Workflows: An RNA Sequencing Case Study. Methods in Molecular Biology, 2019. | See a case study using different tools in NCI’s Cancer Genomics Cloud for analyzing RNA sequencing data.
- QuagmiR: A Cloud-based Application for isomiR Big Data Analytics. Bioinformatics, 2019. | Learn about QuagmiR, a cloud-based tool for analyzing MicroRNA isoforms from next generation sequencing data.
- RNA-seq from Archival FFPE Breast Cancer Samples: Molecular Pathway Fidelity and Novel Discovery. BMC Medical Genomics, 2019. | See information on a formalin-fixed, paraffin-embedded, RNA sequencing pipeline for research on breast cancer.
- Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. Cell Systems, 2018. | Learn about the “Multi-Center Mutation Calling in Multiple Cancers” project. See how this comprehensive encyclopedia of somatic mutations helps enable cross-tumor-type analyses using The Cancer Genome Atlas data sets.
- Ready to start your project? Get an overview of the data science lifecycle and what you should do in each stage.
- Want to learn the basic skills for cancer data science? Check out our basics skills video course.
- Need answers to data science questions? Visit our Training Guide Library.