The Evolution of NCI’s Data Commons
CBIIT Director, Dr. Tony Kerlavage, sat down recently for a podcast examining the evolution of NCI’s Cancer Research Data Commons (CRDC).
The podcast, “Trends from the Trenches,” hosted by Bio-IT World, is the first in a series that offers an inside look at the science, technology, and executive trends driving the life sciences. It aired on February 22, 2022.
From 2006 to 2014, The Cancer Genome Atlas (TCGA) Program was collecting and storing data derived from a network of genomic studies. The data were accessible through a basic portal, but needed to be downloaded and stored locally for analysis. This system simply could not handle the onslaught of new raw data flooding into NCI. In fact, Dr. Kerlavage noted that it would take at least a month to download the entire 2.5 petabytes of TCGA data (at 10 GB per second), putting data out of reach for all but a few of the larger institutions capable of managing such massive downloads.
“We felt that it was vital to democratize the data,” he said, and, “to bring the tools to the data instead of the data to the tools.” This would enable NCI to open the data to new audiences—from bench scientists to computational biologists. The goal was to move the data to the cloud and create an environment that co-located tools and data and that would help move cancer research forward.
This ambitious project called for a wide range of expertise. NCI tapped specialists from the Broad Institute, the Institute for Systems Biology, and Seven Bridges Genomics to develop cloud-based resources. The groups offered different approaches to deploying data in a cloud environment on both Google (GCP) and Amazon (AWS) platforms.
At the same time, different types of data were being generated, including proteomics and imaging data (both radiologic and whole slide pathology images). Being able to combine that information with genomics data offered a new way to look at cancer more holistically. Noted Dr. Kerlavage, “This led to the creation of a ‘commons of commons,’ with genomic, proteomic, imaging, and even canine data repositories.”
“Now, after several years of evolution, we’re continuing to harden the infrastructure. We’re strengthening and standardizing security protocols and methods for indexing data files to conform to NIH-wide and international standards, and we’re working on data harmonization processes to allow us to launch queries across multiple commons,” said Dr. Kerlavage.
What’s on the horizon for the cloud? According to Dr. Kerlavage, one major hurdle lies in the research paradigm itself. Data need to be interoperable, but retrospective data are very hard to integrate. Researchers need to think differently about their data right from the start, ensuring they conform to format and quality standards, and understand those data have value long after the study findings are published.
The advent of new, powerful tools (i.e., machine learning and artificial intelligence) have accelerated research but also exposed data’s shortcomings. “Simply collecting data isn’t enough,” he said, adding, “We need to be sure that data will work in an open architecture.”
Data also need to be well documented. He noted that sample size, statistical significance, missing data, data format, and whether the data are diverse and fully representative of the population—all need to be considered for data to have the greatest scientific value.
In closing, Dr. Kerlavage said, “Each plateau in a data commons’ development teaches us new things and continues to pull the research process forward.” Ultimately, he noted, the commons will rise to meet the growing needs of the full cancer research community, giving access to highly valuable scientific data.