Cancer Data Science Pulse

The Center for Biomedical Informatics and Information Technology Brings Data to Life

What if data could talk? In a sense, they do. Data generate massive amounts of information. For NCI, these mountains of information offer rock-solid facts that are helping researchers better understand how cancer develops, why some people are at greater risk for disease, and which treatments will have the most success. Such information hinges on reliable data; that is, data need to be error-free, easily shared, and readily available. As an example of what it takes to make data reliable, we’re introducing “Datum.” This single speck of data was conceptualized to show how NCI’s Center for Biomedical Informatics and Information Technology (CBIIT) supports cancer research by bringing data to life.
Cartoon depiction of "Datum", a genomic data point from NCI's Genomic Data Commons. He's blue and friendly with big eyes and a friendly smile. Surrounding him around zeros and ones, representing his place among the massive amounts of data collected for cancer research.

My journey started in a teaching hospital, where my patient first learned she might be a good candidate for a clinical trial examining a new treatment to slow the course of non-small cell lung cancer. She gave a tiny tissue sample, and a whole host of data was born. I’m one of those data points.

You can call me datum. Alone I’m not very powerful, but grouped with others like me, I can tell a lot about a person.

The scientists first learned about me when they deciphered my patient’s DNA. Using a collection of high-tech methods called next generation sequencing (or NGS), they started by copying my patient’s DNA over and over again, creating short sequences, called “reads.” By piecing the short reads together and comparing them with a DNA library reference, the scientists were able to determine my patient’s genome. I’m a reflection of that genome—I help tell my patient’s genetics story.

To get a truly complete picture of her genetics, however, the scientists also needed to look closely at some of my other gene-data relatives (RNA) and the products that result from gene expression (proteins). All this genetic information works together so teasing it apart helps scientists figure out what might have made my patient more vulnerable to disease, as well as offer clues to a possible treatment.

For example, alterations in DNA might be tracked to mutations, or changes, in genes associated with a higher risk for cancer development or progression. Scientists use RNA sequencing to look more closely at these mutations and other changes that take place as the gene is translated and put into action in various proteins throughout the body. Protein data have their own story to tell. A malfunction in proteins can be a major contributor to cancer. So knowing how all these data work together is vital to see just where my patient’s body system went awry. This information also helps determine if there is a place in the gene-to-protein pipeline that could be targeted for treatment. In some cases, turning off (or silencing) a gene can disrupt how a protein is made, scaling back production, and reducing the risk for disease.

This is the heart of precision medicine. Not too long ago, doctors thought there was only one type of non-small cell lung cancer. So every patient was given the exact same treatment. Now, with the advent of genetic testing, scientists have identified thousands of different mutations.

Data from those mutations showed that there are at least nine different types of non-small cell lung cancer, which are very different from one other, and each type can now be treated based on the specific type of mutation.

Who knows, maybe I can help unlock another new treatment. Or perhaps I can help identify patients who are susceptible to disease and help mitigate that risk. Already that one tiny sample from my patient has culminated in millions of data points. Some are just like me with their roots in genetics. Others are information on her family history of disease, her age, gender, even things about the environment in which she lives that might make her more (or less) susceptible to disease.

The scientists who generated me knew I had a lot to offer to other studies and other researchers. They also knew they needed more computational power to make the best use of data like me. That’s why I was submitted to NCI’s CBIIT.

CBIIT Bound

My first stop was a quick upload to a supercomputer called Biowulf. There, a group of scientists from CBIIT’s Computational Genomics and Bioinformatics Team put me through my paces, grooming me and getting me ready for analysis. They checked me for bias and errors, ensured I fit within an approved reference genome and workflow, and basically made certain I am who they say I am—a validated representative of my patient’s type of cancer.

Next, with some advice from CBIIT’s Office of Data Sharing, I was uploaded into a data infrastructure spearheaded by CBIIT’s Data Ecosystems Team. This particular collection of repositories is called the Cancer Research Data Commons (CRDC). I was targeted to CRDC’s Genomic Data Commons.

In passing through security, I learned quickly that CBIIT’s Infrastructure and Information Technology Operations Branch has strict rules about data like me. I had to adhere to policies that not only keep me safe and secure but which also allow me to be freely shared with many different labs. For example, I’m no longer directly affiliated with my patient now. Instead, I have a unique identifier assigned just to me. This protects my patient’s identity as her data go far beyond the original clinical study for use by scientists around the world.

When I first arrived at CBIIT, I was amazed at the numbers of other data I saw—terabytes of them. Many were just like me. Others were gleaned from histology slides and other types of imaging and lab data. Another group described patients’ clinical characteristics and family histories of disease.

As you can imagine, making sense of all this data presents a challenge. Fortunately, researchers have a lot of tools to help manage this task, like the ones from CBIIT’s Informatics and Data Science Program. Through this group, researchers can find applications to help in their studies, and training is available from CBIIT’s Computational Genomics and Bioinformatics Team too.

CBIIT also has found ways to make it easier for researchers to integrate me with data from other studies. Data sets can vary quite a bit. We often don’t speak the same language. We all have valuable information, but if we can’t be combined, we don’t help anyone. So I was assigned specific standardized terms and codes based on semantics developed by CBIIT’s NCI Enterprise Vocabulary Services and the Cancer Data Standards Registry and Repository. There’s also a website of Common Data Elements to allow diverse data sets to be grouped and easily compared.

All these measures help ensure I’m kept FAIR (findable, accessible, interoperable, and re-usable), which are universally established guidelines for data.

Now, I’m living in the cloud. CBIIT manages two cloud-based communities that allow for storage and high-power computing. Working in this cloud environment gives researchers greater capacity for analysis, and they don’t have to worry about downloading large data sets. It’s a safe, organized, and highly functional community. Being here helps keep my versions straight so researchers don’t make any mistakes as I’m being shared with one lab and then another. I’ve also had the chance to work with many different tools, which are being updated all the time. This ensures I’ll be accessible and sharable today and in the future. This is particularly important as new technology is developed, especially with the adoption of new and better forms of artificial intelligence.

CBIIT and Beyond

Since coming to CBIIT, I’ve been really busy. My data set must have been pulled into a dozen different analyses in just the past week. One study used me to help train and evaluate a new model that would allow scientists to identify people at risk for my patient’s type of cancer. Another study used me, along with clinical data, to develop a way to predict which patients were most likely to survive lung cancer. I’m really excited about what they found because it means we can improve the clinical outcomes of patients I’ve never even met. I hear they’re going to include me in publications too. So now the world will know about me!

I like the idea of being published. I’ve taken part in histograms, scatter plots, bar plots, box plots, and a bunch of heatmaps. But really, this just scratches the surface. Today, I heard I’m joining researchers who are conducting a series of clinical trials on precision medicine as part of CBIIT’s Clinical and Translational Research Informatics Team. I’ll be working alongside clinical data to help identify treatments that can be specifically tailored to individual cancer patient’s needs.

When I began this journey, I thought I might make a difference to my patient. But now, it’s clear that I have the potential to help hundreds, perhaps thousands, of patients. I may be small, but I’m going to have a big impact on cancer research!

Datum
Genomic data point, NCI's Genomic Data Commons, Cancer Research Data Commons, CBIIT
Older Post
Different Perspectives Lead to Discovery of a Surprisingly Effective Algorithm
Newer Post
CBIIT Welcomes Dr. Jill Barnholtz-Sloan as the New Associate Director for Informatics and Data Science

Leave a Reply

Your email address will not be published.

CAPTCHA Image CAPTCHA