Cancer Data Science Pulse

For the Love of … Data! Drs. Kibbe and Almeida Discuss How Data Help Reveal Our Natural World

We’re continuing the celebration marking NCI’s 50th anniversary and recognizing the Power of Data, by asking NCI and CBIIT staff what data means to them and to the field of cancer research. In this post, we feature former NCI Deputy Director and CBIIT Director, Warren Kibbe, Ph.D. Dr. Kibbe is currently Vice Chair, Department of Biostatistics and Bioinformatics, and chief data officer at the Duke Cancer Institute. He also continues to support NCI, as a researcher, by serving on numerous advisory panels, and through the Childhood Cancer Data Initiative. Also featured is Jonas Almeida, Ph.D., chief data scientist and senior investigator within NCI’s Division of Cancer Epidemiology and Genetics. Look for more perspectives on the power of data in the next blog! 


What made you fall in love with data?

Headshot of Warren Kibbe

Warren: I’m not sure I can point to any one thing. For me it is similar to my love of science. There is a power in the scientific method to uncover the nature of our world, our universe, and ourselves. And for me, the scientific questions I am most interested in all revolve around the nature of living things and, of course, understanding the nature of cancer and how to reduce our risk, detect it earlier, identify the proper therapy, develop new therapies, and optimize our quality of life during survivorship. Answers to all of these questions need data!

Headshot of Jonas Almeida, Ph.D.

Jonas: I share a similar view of using science to know more about our natural world. For me, this started when I was very young. I grew up in Angola in a place where animals appeared to want to “tell you stories.” In fact, I was told as a child that the only reason the animals didn’t actually talk was because “I didn’t listen.” I was particularly intrigued by an ant colony where all sorts of dramatic things occurred throughout the day. Yet sometimes, the ants were nowhere to be found. I had recently learned to write my numbers (this was a long time ago) and decided that every hour I would count how many ants walked across a given path. To my huge surprise, their wanderings kept a schedule, and I could trust my notes to tell me when and where to find them. I will never forget how surprised I was at that discovery. And many years later, when told in school that biology was too complex to be reproducible, I knew that wasn’t true. It can be solved through science using data!  


"Image of circular plot with 6 wedges containing dots that show the cancer types of the TCGA tumor samples that exhibit this mutation with the numbers of mutations (mutations/Mb) increasing as the number of dots move to the edge of the wedge.
Wedge 1, Mutation C->T: (from outer edge to center): Melanoma, Lung Adenocarcinoma, Acute Myeloid Leukemia, Breast cancer, Prostate Cancer, Thyroid Cancer.
Wedge 2, Mutation C->A: Lung Adenocarcinoma, Lung Squamous Cell Carcinoma, Head and Neck cancer, Ewing Sarcoma, Chronic Lymphocytic Leukemia, Ovarian Cancer, Thyroid Cancer, Diffuse large B-cell lymphoma
Wedge 3, Mutation Tp*A->T: Acute Myeloid Leukemia, Ovarian Cancer, Chronic Lymphocytic Leukemia, Diffuse Large B-cell Lymphoma
Wedge 4, Mutation misc: Acute Myeloid Leukemia, Esophageal Adenocarcinoma, Bladder Cancer, Breast Cancer, Lung adenocarcinoma, Lung Squamous Cell Carcinoma, Glioblastoma Multiforme, Ewing Sarcoma, Ovarian, Low-grade Glioma, Chronic Lymphocytic Leukemia
Wedge 5, Mutation *CpG->T: Colorectal, Esophageal Adenocarcinoma, Low-grade Glioma, Stomach, Lung Adenocarcinoma, Head and Neck, EWING Sarcoma, Kidney Clear Cell, Glioblastoma Multiforme, Pancreas, Prostate, Rhabdoid Tumor, Kidney Papillary Cell
Wedge 6, Mutation: Tp*C->mut: Bladder, Cervical, Head and Neck, Lung Squamous Cell CARCINOMA, Breast Cancer, Prostate, Head and Neck, Ovarian, Kidney Clear Cell, Kidney Papillary Cell, Melanoma, Pancreas"

This figure shows the mutational spectrum of a sample of tumors from TCGA on a circular plot. By visualizing data in this way, researchers can see the natural groupings of tumors according to where they fall along a mutational spectrum. (Lawrence MS, Stojanov P, Polak P, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genesNature. 2013;499(7457):214-218. doi:10.1038/nature12213.)

What do you think has been the single greatest accomplishment in data science over the past 50 years of cancer research?

Headshot of Warren Kibbe

Warren: You’re right that data science’s roots extend back quite some time. Although Ada Lovelace is sometimes called the first computer scientist, she was also a data scientist and she lived in the early 1800s. Some 50 years later, Florence Nightingale used powerful data visualization techniques and infographics to explain the causes of mortality resulting from poor medical practices during a military campaign. That was more than 150 years ago! Moving into the last 50 years, there are so many advances to point to, it is hard to pick out just one thing. What I particularly like are combinations of elegant experiments and elegant data, with elegant visualizations that make really large datasets easily understandable. Like Manhattan plots for understanding GWAS data and the associations of specific point mutations with aggressive disease. Or the plot of mutation frequency across many cancers like The Cancer Genome Atlas (TCGA) analysis published in 2013. (See figure.)

Headshot of Jonas Almeida, Ph.D.

Jonas: I also think that TCGA was a very significant advance. Before TCGA we had to scramble to assemble bits and pieces of data on the many biological processes associated with the development of tumors. With TCGA, we have a collection of data with all those pieces—for the same individual, for tens of thousands of volunteers. Thanks to this data we are able to help so many people with so many different types of tumors. 


As NCI embarks on the next 50 years, can you offer any practical tips or advice that should be considered?

Headshot of Warren Kibbe

Warren: Good data, good models, good experiments—all go hand in hand. That is true regardless of the area of research. Well-designed experiments that generate great data can be incredibly explanatory. And new techniques, new technologies, new analytic methods can change the types of questions we can ask, the quality of the data we can generate, and the biology we can explore. We need to always reassess and re-evaluate. The future is now! I hope that within another 50 years we’ll have better ways of preventing cancer, detecting it earlier, and treating it much more effectively. Ultimately, cancer will no longer be listed as a “leading cause of death” by the Centers for Disease Control and Prevention.

Headshot of Jonas Almeida, Ph.D.

Jonas: We often apply quantitative methods, mathematical and computational alike, to a particular research endeavor that initially were developed for use in other disciplines. Fortunately, with the ubiquity of the Cloud and web computing, we’re increasingly able to reconfigure computational labs on the fly. I think we need to try and remove as many obstacles as possible to promote more computational reusability, such as the FAIR principles (that is, data are findable, accessible, interoperable, and reusable). I think we should take them seriously and, in all cases, avoid the short-minded expediency of data silos and closed source applications.


 Share your story! Tell us why YOU fell in love with data too. Use the comment box below to tell us why you love data.
Chief Data Scientist and Senior Investigator, Division of Cancer Epidemiology and Genetics, NCI
Vice Chair, Department of Biostatistics and Bioinformatics, and Chief Data Officer, Duke Cancer Institute
Older Post
NCI’s Cloud Resources Help Tame Today’s Data Windfall
Newer Post
For the Love of . . . Data! CBIIT Director Tony Kerlavage Looks at Advances in Data and Technology

Leave a Reply

Vote below about this page’s helpfulness.

Your email address will not be published.


Enter the characters shown in the image.