Cancer Data Science Pulse
NCI Brings Cancer Data to Life
What if Data Could Talk?
To celebrate International Love Data Week 2023, we’re letting our favorite NCI genomic data point share the story of how NCI is bringing cancer data to life to drive insights for cancer research.
Read Datum's Origin Story
My journey started in a teaching hospital, where my patient first learned she might be a good candidate for a clinical trial examining a new treatment to slow the course of non-small cell lung cancer. She gave a tiny tissue sample, and a whole host of data was born. I’m one of those data points.
You can call me Datum. Alone I’m not very powerful, but grouped with others like me, I can tell a lot about a person.
The scientists first learned about me when they deciphered my patient’s DNA. Using a collection of high-tech methods called next generation sequencing (or NGS), they started by copying my patient’s DNA over and over again, creating short sequences, called “reads.” By piecing the short reads together and comparing them with a DNA library reference, the scientists were able to determine my patient’s genome. I’m a reflection of that genome—I help tell my patient’s genetics story.
To get a truly complete picture of her genetics, however, the scientists also needed to look closely at some of my other gene-data relatives (RNA) and the products that result from gene expression (proteins). All this genetic information works together so teasing it apart helps scientists figure out what might have made my patient more vulnerable to disease, as well as offer clues to a possible treatment.
For example, alterations in DNA might be tracked to mutations, or changes, in genes associated with a higher risk for cancer development or progression. Scientists use RNA sequencing to look more closely at these mutations and other changes that take place as the gene is translated and put into action in various proteins throughout the body. Protein data have their own story to tell. A malfunction in proteins can be a major contributor to cancer. So knowing how all these data work together is vital to see just where my patient’s body system went awry. This information also helps determine if there is a place in the gene-to-protein pipeline that could be targeted for treatment. In some cases, turning off (or silencing) a gene can disrupt how a protein is made, scaling back production, and reducing the risk for disease.
This is the heart of precision medicine. Not too long ago, doctors thought there was only one type of non-small cell lung cancer. So every patient was given the exact same treatment. Now, with the advent of genetic testing, scientists have identified thousands of different mutations.
Data from those mutations showed that there are at least nine different types of non-small cell lung cancer, which are very different from one other, and each type can now be treated based on the specific type of mutation.
Who knows, maybe I can help unlock another new treatment. Or perhaps I can help identify patients who are susceptible to disease and help mitigate that risk. Already that one tiny sample from my patient has culminated in millions of data points. Some are just like me with their roots in genetics. Others are information on her family history of disease, her age, gender, even things about the environment in which she lives that might make her more (or less) susceptible to disease.
The scientists who generated me knew I had a lot to offer to other studies and other researchers. They also knew they needed more computational power to make the best use of data like me. That’s why I was submitted to NCI.
My first stop was a quick upload to a supercomputer called Biowulf. There, a group of scientists from the Computational Genomics and Bioinformatics Team put me through my paces, grooming me and getting me ready for analysis. They checked me for bias and errors, ensured I fit within an approved reference genome and workflow, and basically made certain I am who they say I am—a validated representative of my patient’s type of cancer.
Next, with some advice from NCI's Office of Data Sharing, I was uploaded into a data infrastructure spearheaded by the Data Ecosystems Team. This particular collection of repositories is called the Cancer Research Data Commons (CRDC). I was targeted to CRDC’s Genomic Data Commons.
In passing through security, I learned quickly that NCI's Office of the Chief Information Officer has strict rules about data like me. I had to adhere to policies that not only keep me safe and secure but which also allow me to be freely shared with many different labs. For example, I’m no longer directly affiliated with my patient now. Instead, I have a unique identifier assigned just to me. This protects my patient’s identity as her data go far beyond the original clinical study for use by scientists around the world.
When I first arrived at NCI, I was amazed at the numbers of other data I saw—terabytes of them. Many were just like me. Others were gleaned from histology slides and other types of imaging and lab data. Another group described patients’ clinical characteristics and family histories of disease.
As you can imagine, making sense of all this data presents a challenge. Fortunately, researchers have a lot of tools to help manage this task, like the ones from the Informatics and Data Science Program. Through this group, researchers can find applications to help in their studies, and training is available from the Computational Genomics and Bioinformatics Team too.
NCI also has found ways to make it easier for researchers to integrate me with data from other studies. Data sets can vary quite a bit. We often don’t speak the same language. We all have valuable information, but if we can’t be combined, we don’t help anyone. So I was assigned specific standardized terms and codes based on semantics developed by NCI's Enterprise Vocabulary Services and the Cancer Data Standards Registry and Repository. There’s also a website of Common Data Elements to allow diverse data sets to be grouped and easily compared.
All these measures help ensure I’m kept FAIR (findable, accessible, interoperable, and re-usable), which are universally established guidelines for data.
Now, I’m living in the cloud. NCI manages two cloud-based communities that allow for storage and high-power computing. Working in this cloud environment gives researchers greater capacity for analysis, and they don’t have to worry about downloading large data sets. It’s a safe, organized, and highly functional community. Being here helps keep my versions straight so researchers don’t make any mistakes as I’m being shared with one lab and then another. I’ve also had the chance to work with many different tools, which are being updated all the time. This ensures I’ll be accessible and sharable today and in the future. This is particularly important as new technology is developed, especially with the adoption of new and better forms of artificial intelligence.
Since coming to NCI, I’ve been really busy. My data set must have been pulled into a dozen different analyses in just the past week. One study used me to help train and evaluate a new model that would allow scientists to identify people at risk for my patient’s type of cancer. Another study used me, along with clinical data, to develop a way to predict which patients were most likely to survive lung cancer. I’m really excited about what they found because it means we can improve the clinical outcomes of patients I’ve never even met. I hear they’re going to include me in publications too. So now the world will know about me!
I like the idea of being published. I’ve taken part in histograms, scatter plots, bar plots, box plots, and a bunch of heatmaps. But really, this just scratches the surface. Today, I heard I’m joining researchers who are conducting a series of clinical trials on precision medicine as part of the Clinical and Translational Research Informatics Team. I’ll be working alongside clinical data to help identify treatments that can be specifically tailored to individual cancer patient’s needs.
When I began this journey, I thought I might make a difference to my patient. But now, it’s clear that I have the potential to help hundreds, perhaps thousands, of patients. I may be small, but I’m going to have a big impact on cancer research!
What's Next For Datum?
Get a sneak peak on what Datum and NCI are working on in the field of machine learning.
Iris Smith on September 20, 2022 at 10:55 p.m.