Cancer Data Science Pulse

Five Data Science Technologies Driving Cancer Research

To commemorate the National Cancer Act’s 50th anniversary, we’ve pulled together Five Data Science Technologies poised to make a difference in how cancer is diagnosed, treated, and prevented.  

Scroll through the infographic below and see if you agree. Some tools are still in their infancy and not quite ready for prime time, whereas others already are revolutionizing research—helping to reveal the underpinnings of cancer and guiding highly tailored precision medicine for more effective treatments.

We’d love to hear your ideas too! Leave your feedback about these technologies or others you're using in the comment form below.

Older Post
Social Determinants of Health—At the Crossroads of Biology and Sociology
Newer Post
Blinded by the Light—Seeking the Truth Behind Data Outliers

Leave a Reply

Vote below about this page’s helpfulness.

Your email address will not be published.

CAPTCHA Image CAPTCHA
When you combine a cancer patient's diagnostic and clinical data, they may produce 1 Terabyte of Biomedical data. According to GLOBCAN 2018 projections, there will be 18.1 million additional cancer diagnoses worldwide. Every year, approximately 18 Exabytes of data will be produced as a result of cancer-related research.

Researchers currently use this data to analyze the disease on three levels:

Cellular - Researchers typically seek for particular patterns in the data to reveal genetic biomarkers that can assist us with predicting tumor mutations and drug therapy.

Patient-Researchers can use their tumor and gene type knowledge to determine the greatest treatments for patients based on their medical history and DNA data.

Population – Treatment alternatives for cancer patients vary based on their different lifestyles, regions, and types of cancer.

The Genome Sequence is one of the most frequent ways to study cancer, during which we analyze the DNA sequence of a single, homogeneous or heter

The real challenge with this huge data set is to find the proper tool to store it for a long time, process, analyze and visualize it.

Although exact retention and deletion policies for data sets are not specified, the fact that they exist demands a method for archiving and keeping them indefinitely.

Data Collection/Transport Mechanism: SFTP,Hadoop Discp, Apache NiFi

Processes/Analyze: Apache Hadoop/MapReduce, Apache Hive, Apache Spark in ML Library,R Packages on top of Spark,Vertica, ETL tools like

Visualization: Tableau,Microstrategy,d3js.
Thanks
Thank you for taking the time to share your thoughts. We enjoyed reading your comment! The vast and growing amount of biomedical information we generate can be a boon and a burden without the infrastructure to maintain it. This infrastructure not only needs to manage, store, and process petabytes of information but also needs to make that data accessible to researchers regardless of the computational capacity of their own machine. That is one of the reasons we are excited about the possibilities of the NCI Cancer Research Data Commons. The Genomic Data Commons alone sees on average 70,000 users accessing and analyzing 2 petabytes of data monthly. Through its cloud platform, the CRDC has an opportunity to make many types of data and analytical tools available to the research community. With all the data harmonized on the same platform, those different kinds of research data you mentioned could be combined for integrative analysis. The challenges with big biomedical data are a moving target, but we are making strides in developing a scalable infrastructure that can grow with our researchers’ needs.