Exploring and Analyzing Data: The Basics
What is Data Exploration and Analysis?
This two-part stage of the data science lifecycle helps you identify what you want to learn from the data, and then act toward understanding the meaning of that data.
Begin by exploring the data, that is, getting familiar with it. You’ll look for patterns and trends in your data set to form a hypothesis(es) that you may want to investigate further. This may include visualizing your data and trying to identify the relationships between variables.
Next, analyze your data. You’ll use statistical models and/or machine learning (a form of artificial intelligence) to test the hypothesis(es) that interested you during exploration. As an example, let’s say you find a relationship between two variables. You decide to build a more sophisticated model to see if the relationship holds up. (This might make for a good question-and-answer format.)
Why is Data Exploration and Analysis Important for Cancer Research?
How we use data greatly impacts our ability to advance cancer research. Data exploration and analysis can lead you to new ideas, new discoveries, and ensure you’re getting the most out of the data you’ve collected during your research.
What Do I Need to Know?
Data Exploration and Analysis Concepts
As you’re exploring and analyzing your data, there are important concepts to keep in mind. We’ve outlined some key terms so that you can familiarize yourself with them and get an idea of how to apply them to your data.
- Study design: This impacts your exploration and analysis. Some common study designs are prospective cohort, retrospective cohort, case-control, and a randomized clinical trial. Each has strengths and weaknesses and will impact the type of analysis that you do.
- Bias: To understand the limits of your data, you’ll need to identify potential biases in the data (e.g., “Are certain data elements not available for certain people?”). Whoever was included in the study and whatever population he or she represents (or doesn’t) contributes to potential bias.
- Variable distribution: As you’re comparing variables and relationships between your data, you’ll want to know if there were any unique factors in how you (or another researcher) collected and reported the data. This can help you better understand the source of outliers or bias in your data.
- Familiarize yourself with the data: This includes determining what type of data you have (e.g., categorical, continuous, ordinal). Then, examine the relative number of times each outcome occurred, the range of possible values, and the mean or median. Also, pay attention to whether you have missing data. Missing data can occur if a variable was not collected or available for certain people. If the amount of missing data is large, you may not be able to use that variable in your analysis or consult an expert to help you.
- Outliers: Look for outliers, which are data points that differ significantly from your other observations.
- Hypothesis: As you explore your data, you’ll come up with a question you want to answer and create a hypothesis (a guess about what you think you’ll learn about that question). You’ll choose a statistical model or machine learning tools to analyze your data try to prove or disprove your hypothesis through your research.
Fundamental Tips for Effective Data Exploration and Analysis
- Talk to lots of different types of researchers who are doing different types of analysis. Your peers can help! You’ll learn how other people are using tools, or you’ll discover new tools. Collaborating may help formulate your hypothesis(es).
- Learn different statistical and machine learning techniques. Basic techniques such as plotting data or simple linear regression can help you start exploring your data.
- Read published papers in your field. This will give you a more well-rounded idea of what questions peers are asking, what tools they’re using, and methodologies you might use when exploring and analyzing your own data.
NCI Data Exploration and Analysis Resources and Initiatives
Now that you have a sense of the basics, use the following resources to discover more about the topic and understand NCI’s investment in this stage of the data science lifecycle.
Recurring Events
- The Advanced Biomedical Computational Science group and Frederick National Laboratory for Cancer Research host a "Statistics for Lunch" webinar series. Watch their video on “Introduction to Data Exploration.”
Resources and Tools
- Cancer Proteogenomic Data Analysis Site: Access data from the Clinical Proteomic Tumor Analysis Consortium in this web-based interactive platform for proteogenomic data analysis.
- Seven Bridges Cancer Genomics Cloud: This is a good resource regardless of your cloud computing skills. Explore this user-friendly portal and find common, best-practice analysis methods at your fingertips! You can browse a variety of diverse data sets or use your own along with publicly available data. Some data requires an access process.
- ISB Cloud: Through the ISB Cancer Gateway in the Cloud, you can access, explore, and analyze large-scale cancer data through Google Cloud. The resources on this platform enable you to do complex queries from R or Python scripts, or Dockerized workflows, to run on data available in the Google Cloud Storage.
- Broad Institute FireCloud: This platform provides workspaces for bringing together data and tools. Here you can store your private data and the outputs of your analysis. You can also launch interactive analytical applications from the dashboard’s analysis tab.
- The Informatics Technology for Cancer Research Training Network offers a variety of courses.
- In the course “Choosing Genomics Tools,” you’ll find resources and tools for processing and interpreting your data.
- In the course “Computing for Cancer Informatics,” you’ll learn about shared computing resources, data sizes and computational capacity, computing resources designed for research, and more.
Blogs
- ISB-CGC Cloud Resource: Providing Researchers with Shortcuts to Data Analysis: Learn about an NCI cloud resource that can speed up your data analysis process.
- How the Mitelman Database Could Help You Explore Genomic Abnormalities: Are you working with genomic data? Check out this blog about the Mitelman Database, where you can also learn about templates that you can use for your exploration.
- Cloud Resources: Cancer Genomics Cloud (CGC) Helps Power Discovery and Analysis to Advance Cancer Research: Discover how the CGC, a cloud-based platform, can enable you to conduct cancer data analysis more efficiently.
- FireCloud: A Secure Platform for Data Analysis Powered by Terra: Discover the benefits of using FireCloud—an NCI-funded project for accessing data, running analysis tools, and collaborating securely.
Publications
- DepLink: An R Shiny App to Systemically Link Genetic and Pharmacologic Dependencies of Cancer. Bioinformatics Advances, 2023. | Read about an example of researchers using R Shiny to analyze data sets.
- The Relationship Between Family History of Cancer and Cancer Attitudes & Beliefs Within the Community Initiative Towards Improving Equity and Health Status (CITIES) Cohort. PLoS One, 2023. | See how researchers analyzed multiple types of data to examine whether a family history of cancer impacted cancer attitudes and beliefs.
- IndepthPathway: An Integrated Tool for In-depth Pathway Enrichment Analysis Based on Single-cell Sequencing Data. Bioinformatics, 2023. | Explore the development of an analysis specialized for pathway enrichment analysis from single-cell transcriptomics.
Additional Data Exploration and Analysis Resources
- The National Center for Biotechnology Information has an archive of past webinars and workshops where you can find recordings of how to use tools for data analysis.
- You can email Dr. Anne-Michelle Noone, mathematical statistician with the NCI Division of Cancer Control and Population Sciences and contributor to this training article, if you have questions about data exploration and analysis.
- Ready to start your project? Get an overview of the data science lifecycle and what you should do in each stage.
- Want to learn the basic skills for cancer data science? Check out our basics skills video course.
- Need answers to data science questions? Visit our Training Guide Library.