Exploring and Analyzing Data: The Basics

Exploring and Analyzing Data: The Basics

What is Data Exploration and Analysis?

This two-part stage of the data science lifecycle helps you identify what you want to learn from the data, and then act toward understanding the meaning of that data. 

Begin by exploring the data, that is, getting familiar with it. You’ll look for patterns and trends in your data set to form a hypothesis(es) that you may want to investigate further. This may include visualizing your data and trying to identify the relationships between variables. 

Next, analyze your data. You’ll use statistical models and/or machine learning (a form of artificial intelligence) to test the hypothesis(es) that interested you during exploration. As an example, let’s say you find a relationship between two variables. You decide to build a more sophisticated model to see if the relationship holds up. (This might make for a good question-and-answer format.)

Why is Data Exploration and Analysis Important for Cancer Research?

How we use data greatly impacts our ability to advance cancer research. Data exploration and analysis can lead you to new ideas, new discoveries, and ensure you’re getting the most out of the data you’ve collected during your research.

What Do I Need to Know?

Data Exploration and Analysis Concepts

As you’re exploring and analyzing your data, there are important concepts to keep in mind. We’ve outlined some key terms so that you can familiarize yourself with them and get an idea of how to apply them to your data. 

  • Study design: This impacts your exploration and analysis. Some common study designs are prospective cohort, retrospective cohort, case-control, and a randomized clinical trial. Each has strengths and weaknesses and will impact the type of analysis that you do. 
  • Bias: To understand the limits of your data, you’ll need to identify potential biases in the data (e.g., “Are certain data elements not available for certain people?”). Whoever was included in the study and whatever population he or she represents (or doesn’t) contributes to potential bias. 
  • Variable distribution: As you’re comparing variables and relationships between your data, you’ll want to know if there were any unique factors in how you (or another researcher) collected and reported the data. This can help you better understand the source of outliers or bias in your data.
  • Familiarize yourself with the data: This includes determining what type of data you have (e.g., categorical, continuous, ordinal). Then, examine the relative number of times each outcome occurred, the range of possible values, and the mean or median. Also, pay attention to whether you have missing data. Missing data can occur if a variable was not collected or available for certain people. If the amount of missing data is large, you may not be able to use that variable in your analysis or consult an expert to help you. 
  • Outliers: Look for outliers, which are data points that differ significantly from your other observations. 
  • Hypothesis: As you explore your data, you’ll come up with a question you want to answer and create a hypothesis (a guess about what you think you’ll learn about that question). You’ll choose a statistical model or machine learning tools to analyze your data try to prove or disprove your hypothesis through your research.

Fundamental Tips for Effective Data Exploration and Analysis

  • Talk to lots of different types of researchers who are doing different types of analysis. Your peers can help! You’ll learn how other people are using tools, or you’ll discover new tools. Collaborating may help formulate your hypothesis(es).
  • Learn different statistical and machine learning techniques. Basic techniques such as plotting data or simple linear regression can help you start exploring your data.
  • Read published papers in your field. This will give you a more well-rounded idea of what questions peers are asking, what tools they’re using, and methodologies you might use when exploring and analyzing your own data. 

NCI Data Exploration and Analysis Resources and Initiatives

Now that you have a sense of the basics, use the following resources to discover more about the topic and understand NCI’s investment in this stage of the data science lifecycle.

Recurring Events

Resources and Tools

  • Cancer Proteogenomic Data Analysis Site: Access data from the Clinical Proteomic Tumor Analysis Consortium in this web-based interactive platform for proteogenomic data analysis.
  • Seven Bridges Cancer Genomics Cloud: This is a good resource regardless of your cloud computing skills. Explore this user-friendly portal and find common, best-practice analysis methods at your fingertips! You can browse a variety of diverse data sets or use your own along with publicly available data. Some data requires an access process.  
  • ISB Cloud: Through the ISB Cancer Gateway in the Cloud, you can access, explore, and analyze large-scale cancer data through Google Cloud. The resources on this platform enable you to do complex queries from R or Python scripts, or Dockerized workflows, to run on data available in the Google Cloud Storage.
  • Broad Institute FireCloud: This platform provides workspaces for bringing together data and tools. Here you can store your private data and the outputs of your analysis. You can also launch interactive analytical applications from the dashboard’s analysis tab.
  • The Informatics Technology for Cancer Research Training Network offers a variety of courses.
    • In the course “Choosing Genomics Tools,” you’ll find resources and tools for processing and interpreting your data.
    • In the course “Computing for Cancer Informatics,” you’ll learn about shared computing resources, data sizes and computational capacity, computing resources designed for research, and more.

Blogs

Publications

Additional Data Exploration and Analysis Resources

  • The National Center for Biotechnology Information has an archive of past webinars and workshops where you can find recordings of how to use tools for data analysis.
  • You can email Dr. Anne-Michelle Noone, mathematical statistician with the NCI Division of Cancer Control and Population Sciences and contributor to this training article, if you have questions about data exploration and analysis. 

 

Updated:
Return to the previous stage
Cleaning Data
Continue to the next stage
Predictive Modeling
Vote below about this page’s helpfulness.