Cancer Data Science Pulse
For the Love of . . . Data! Dr. Jerry Li Describes the Next Data Revolution
We’re continuing our recognition of NCI’s 50th anniversary by asking NCI and CBIIT staff what data means to them and to the field of cancer research. In this post, we feature Jerry Li, M.D., Ph.D., program director, Division of Cancer Biology, NCI.
What made you fall in love with data?
Looking back on my career, it’s pretty easy to see when I first saw the importance of data. I did my post-doctoral training at NCI where we were using mouse models to better understand cancer development. The project I worked on was seeking to identify oncogenes related to leukemogenesis, that is, those genes underlying leukemia. That was in the mid-90s. At the time, there was no such thing as “big data” or “big science.” We didn’t know the genomic sequences for mice or for humans. And identifying oncogenes was a truly arduous process.
Only a handful of oncogenes had been discovered that were associated with cancer development. Moreover, it took a good amount of time to identify a single oncogene, usually a decade or so, or even longer. When one of these discoveries was made, it was a very big deal, with publication in a top-tier journal and much recognition. It involved a lot of time, a lot of people, and a lot of work.
Using a mouse model, we searched for genes with a common viral insertion site (that is, the virus frequently inserted into the same genomic region of the mouse). If the virus insertion site is close to an oncogene, it could cause increased expression of the gene and cancer progression. It sounds simple, but once we identified the virus insertion site, it took a long time to identify all the genes that were close to the region and confirm the ones affected by the insertion. And, of course, we didn’t have the whole genomic sequences to use for reference.
The genes in the vicinity of the viral insertion site are candidates for the discovery of oncogenes associated with leukemogenesis.
But imagine if we already knew the genomic sequences? Then we wouldn’t have to spend 80% of our time trying to find the genomic sequences flanking the virus integration site. At a glance, we’d know what the potential oncogenes were. But at that time, the thought of having the whole genomic DNA sequence was like science fiction.
Today, with the sequencing of the genome, we have that power. We have that data. It’s routine practice now. I can’t imagine going back to that time. This is what I think about when I think about the importance of data.
What do you think has been the single greatest accomplishment in data science over the past 50 years of cancer research?
Without a doubt, it’s the sequencing of the genome through the Human Genome Project (HGP). That led to high-quality data and the strong reference data we needed for research. It’s so important to have reliable data. I was fortunate to have a part in this. Following my work at NCI, I joined Celera Genomics and worked on the HGP from 1999–2002. We completed the first draft of the human and mouse genomes in 2000. To me, it’s unthinkable to imagine doing research without these data. It was a quantum leap that led to new discoveries in labs across the world.
It’s ironic too, given the impact of this sequencing, that early on there was a lot of debate whether mapping the genome was even worth it. But this advance in big data has since led to big science, giving researchers the resources they need to make tremendous progress in biomedical research. Even beyond science, in daily life, it’s led to changes that have touched many people’s lives. You can even use this data to trace your ancestry information.
As NCI embarks on the next 50 years, can you offer any practical tips or advice that should be considered?
The HGP was unique in that it was truly a public-private partnership. By pooling those collective resources, we generated a wide range of innovations—in hardware, equipment, sequencing technologies, sample preparations, and more. All of these innovations helped to map the genome. Advances in computational methods were equally important, enabling us not only to generate the data but also to piece it all together and make sense of it.
Where do we go next? I think as big as HGP’s impact was on science, artificial intelligence (AI) has the potential to revolutionize scientific discovery even more over the next 50 years. It’s not brand new; in fact, it’s been around for decades. But AI’s progress in just the past 10 years has been quickly gaining momentum, primarily because of three key reasons. First, advances in producing fast computer chips such as GPUs. Second, there’s new algorithm designs in the form of artificial neural networks, giving us new ways of processing data and making predictions. For example, even now we can use AI to interpret radiographic images and to predict genomic mutations. And third, and probably the most important in biomedical research, we have a lot of data. We can use that data to train AI applications and to further refine these processes.
Today, we have access to faster computer chips and new software. We can generate tremendous amounts of data. The next innovation needs to make that data relevant and available to the biomedical field.
We need to train AI systems and make the data useful. We need to get on the “AI-revolution train” and take advantage of this fast-moving area of research. Then, we’ll be able to take the next quantum leap in cancer research and care. That’s the next challenge and the next opportunity facing NCI.