Cancer Data Science Pulse

The Cancer Genome Atlas (TCGA)—A Living Legacy for Cancer Research

If you’re studying the genes underlying cancer, you’re likely familiar with The Cancer Genome Atlas (TCGA). This landmark collection maps the genomic profiles of 33 cancer types and subtypes, including 10 rare cancers, from bladder and breast to pancreatic and uterine. These “atlases,” reveal the molecular features associated with cancer and help to inform everything from basic research to drug development and precision medicine. Moreover, all TCGA data are all in the public domain, allowing any researcher with an interest to access this information.

We’re commemorating National DNA Day on April 25 by celebrating this remarkable collection of data sets. See how TCGA’s collection of genomic data gives cancer researchers a lasting legacy that’s still very much alive today.

Is TCGA Still Collecting Data?

Despite the TCGA program closing in 2018, it continues to be an integral part of modern-day cancer genomics analyses. The latest version of the data, processed according to best-in-practice bioinformatics pipelines (which have continual updates), can be found at NCI’s Genomic Data Commons (GDC). Moreover, researchers are actively characterizing and releasing data for newly sequenced whole genomes. This enables us to continue to use and build on TCGA to inform new research, leading to new data on the genes, proteins, pathways, and drivers underlying cancer.

Where Can I Find TCGA Data Today?

Nearly 2.5 petabytes of TCGA data are available through NCI’s GDC. Visit NCI’s Center for Cancer Genomics to learn more about the cancer types and criteria for TCGA’s data set and see citations for seminal studies.

What Type of Data Are in TCGA?

In the TCGA, you’ll find:

  • whole genome sequence (WGS),
  • whole exome sequence,
  • methylation,
  • RNA expression,
  • microRNA, transposase-accessible chromatin with sequencing (ATAC-Seq),
  • reverse phase protein array (RPPA),
  • tissue slide images, and
  • clinical data sets.

How is TCGA Unique?

Team science led to the development of TCGA, with contributions from scientists at NCI and the National Human Genome Research Institute, along with thousands of researchers from institutions around the country. Together, this team developed TCGA’s technology, tools, and resources and carried out characterization of thousands of samples. Importantly, more than 11,000 patients contributed their samples to science. This open-science framework continues today.

Open data sharing means that anyone can access these data sets (from the lab next door to one around the world). This broad data sharing helped expand the usefulness of the data in TCGA, as researchers look for new ways to limit bias and make the data more applicable to more people.

TCGA’s collection also features “normal control” data. This means that patients gave blood or tissue samples taken from near the cancerous tumor, in addition to the tumor itself. Having normal samples offers a control, allowing researchers to examine the differences between normal and cancerous tissues.

How Has TCGA Influenced the Cancer Research Field?

Before TCGA, it was difficult for researchers to assemble all the bits and pieces of data on the numerous biological processes associated with cancer. Once TCGA came to fruition, it revolutionized research on the molecular mechanisms underlying cancer.

If you use scholarly publications as a yardstick, you’ll find that PubMed features more than 29,000 papers with mention of TCGA. Last year alone, there were over 5,000 TCGA citations.

Tools are another good measure of how these data are impacting the field. Since TCGA’s start, we now have countless tools to help you navigate and analyze these data, including many built by original TCGA team researchers. For example, cBioPortal provides an interface to help you analyze genetic and clinical data to study cancer and how it progresses over time. And of course, the most up-to-date version of the data (along with new visualization and cohort analysis tools) are available via NCI’s GDC.

How Does the GDC Help Me Use TCGA Data?

The GDC’s data portal gives you a full suite of web-based tools for studying TCGA data (along with other large-scale data sets), including building and comparing cohorts, examining mutation frequencies, visualizing gene expression clusters, and more. These analysis tools can help you access and use genomic data, no matter your skill level or experience. Alternatively, you can access these data in the cloud, letting you work with large data sets without needing to download and store those data.

And you don’t have to start from scratch. Scientists use TCGA data to develop numerous pipelines and other methodologies for studying cancer. You can also find tools with shortcuts for processing and analyzing data from start to finish, such as the Multi-omics Pathway Workflow, or MOPAW, and BigQuery.

Why is TCGA Data Particularly Good for Data Science?

TCGA’s tumor profiles provide a rich resource for exploring any number of topics of interest: from new drugs or biomarkers to new ways of diagnosing, preventing, and treating cancer. Perhaps one of the biggest impacts for TCGA data are “pan-cancer” studies, that is, examining multiple cancers at a time to reveal machinery that’s similar among many tumors, no matter the tissue or organ of origin.

I Want to Integrate My Data With TCGA’s Data for Analysis—What Format Do I Use?

The Cancer Genomics Cloud offers a detailed table of available TCGA data formats. Information on TCGA’s metadata also is available, including a separate listing for TCGA’s Genome Reference Consortium Human Build 38 (GRCH38) assembly.

What’s Next for TCGA?

TCGA may be a “closed” program, but the data are continuing to have a major impact on the cancer research field. New tools and technologies, such as artificial intelligence (AI), have the potential to transform cancer treatment and care.

For example, researchers are using data to train and refine AI models for diagnosing, predicting, and tracking cancer. AI’s especially promising for precision oncology—helping to predict how a patient will respond to treatment so clinicians can select the best treatment right from the start.

Importantly, with TCGA, researchers have the most vital commodity for moving the field forward—open access to data. By broadly sharing data, TCGA, and other data sets like it, give researchers the information they need to develop new and better ways of diagnosing, treating, and preventing cancer.

Jean C. Zenklusen, M.S., Ph.D.
Deputy Director, NCI Center for Cancer Genomics
Peggy Wang, Ph.D.
Former Program Specialist, NCI Center for Cancer Genomics
Older Post
Providing NCI with Nearly 40 Years of Biostatistics: A Conversation with Dr. Eric “Rocky” Feuer
Newer Post
Dr. Tony Kerlavage Reflects on His Time at NCI CBIIT

Leave a Reply

Vote below about this page’s helpfulness.

Your email address will not be published.


Enter the characters shown in the image.