How to Use the Cancer Data Aggregator

How to Use the Cancer Data Aggregator

What is the Cancer Data Aggregator?

The Cancer Data Aggregator (CDA) is a resource that lets you search for data across NCI’s Cancer Research Data Commons (CRDC). The CDA includes standardized and indexed terms from the Genomic Data Commons (GDC), Imaging Data Commons, Proteomic Data Commons (PDC), Integrated Canine Data Commons, and the Cancer Data Service.

Screenshot of CDA interactive search under the Search Tool heading in the page navigation. The search has a default filter in the “Custom Search Builder” section with the following search parameters configured (from left-right): And, primary_diagnosis_condition, contains, adenocarcinoma. In the screenshot, only four rows and six columns of the table are shown. The table rows are as follows: Column 1 - name: Subject_id, value: CPTAC.11LU022; Column 2 - name: data source, value: GDC IDC PDC; Column 3 - name: ethnicity, value: not reported; Column 4 - name: race, value: asian; Column 5 - name: sex, value: male; Column 6 - name: species, value: Homo sapiens.

The four rows contain the same values for the first five columns.
If you are looking for a no-code solution, CDA’s interactive search table allows you to explore data sets (e.g., adenocarcinoma data) across the CRDC.

This accessible and easy-to-use tool not only allows you to collect data but also explore and analyze that data, making it an invaluable asset if you’re following the cancer data science lifecycle research process. You can find information using harmonized, common language terms. You can then easily work with your search results in Excel, integrate them into a pipeline, or upload to an NCI Cloud Resource.
 

Say you want to find data for your cancer research. What does that process look like with CDA?

Single image showing two screenshots. On the left: Shows proteomics.datacommons.cancer.gov website with the PDC case ID from the CPTAC.11LU022 inputted in search (the case ID is “f1ed961a-cf1e-11e9-9a07-0a80fada099c “). Under the search is the resulting pop-up box with the case summary for the 11U022 data with tabs to display metadata for demographics, diagnosis, exposure, follow-up, and treatment. The active tab shows tables with the file count by experimental strategy and data category. On the right: Shows gdc.cancer.gov website with the GDC case ID from the CPTAC.11LU022 inputted in the search (the case ID for GDC is (“8d1b1bb3-2633-4a22-a8fd-19c07931ea46”). Under the search, three tables display case summary information, file counts by data category, and file counts by experimental strategy.
You can enter the subject ID (in the second-to-last column of the table) directly in the GDC (top) and PDC (bottom) portal search boxes to quickly find the data.
  1. You have an idea and want to see what data is available.
  2. You use CDA to search across all data commons.
  3. You use the IDs to:
    1. navigate to the data commons hosting the data you’d like to learn more about.
    2. visit dbGaP and begin the approval process for accessing that data (if any data are controlled access).
    3. transfer the data to ISB Cancer Genomics Cloud (ISB-CGC), Broad FireCloud powered by Terra, or Seven Bridges Cancer Genomics Cloud (SB-CGC) powered by Velsera to prepare for your analysis.

What Do I Need to Know?

Fundamental Tips for Using the CDA Effectively

The CDA has many features that will help you with your cancer research.

CDA tools for beginners include:

More advanced CDA tools include:

And remember, the CDA Team is always available to help! Whether you need assistance with running complex queries, guidance with writing code, general advice, and more, you can reach the CDA Team through the helpdesk or email.


NCI CDA Resources and Initiatives

NCI-Supported Projects

You’ll find CDA contributions in other NCI-supported projects that may be helpful for your cancer research. Explore them to see whether you could benefit from these resources.

  • ISB-CGC is an NCI cloud resource that lets you access, explore, and analyze large-scale cancer data through the Google cloud platform. ISB-CGC provides mutation data to CDA, and CDA provides ISB-CGC with aggregated data from across the CRDC. Through ISB-CGC, you can access smaller, easier to use files that ISB-CGC has processed from CRDC data commons. You can also find data through the CDA and then export the data to ISB-CGC to do your analysis. Explore the ISB-CGC BigQuery Table Search to browse tables of metadata and molecular cancer data!
  • FireCloud and SB-CGC are NCI-funded cloud platforms you can use for data analysis. Once you’ve found the data you want on CDA, both platforms provide you with user-friendly access to a range of analysis pipelines. Just upload the identifiers you found at CDA to start your analysis.

Publications

If you’d like to learn more about the components of CRDC that make CDA possible, read this American Association for Cancer Research journal article about CRDC core standards and services.

Explore our Cancer Data Science Pulse blog on the CDA if you want to read about the background and development of the CDA.


Additional CDA Resources

Updated:
Vote below about this page’s helpfulness.