Cancer Data Science Pulse

The Cancer Data Science Pulse blog provides insights on trends, policies, initiatives, and innovation in the data science and cancer research communities from professionals dedicated to building a national cancer data ecosystem that enables new discoveries and reduces the burden of cancer.

I recently joined NCI to help support strategic data sharing and informatics projects within the Center for Biomedical Informatics and Information Technology (CBIIT). Having worked on information management at another Institute for five years and the trans-NIH Big Data to Knowledge (BD2K) initiative since its inception, this is an exciting opportunity for me to continue to contribute to enhancing data science across the biomedical community.

Biomedical research is evolving with an increasing emphasis on data science, e.g., data integration and storage, data privacy and security, data analytics and data representation, driven by the transformative technologies that have become the currency of genomics in precision medicine. In spite of numerous "beachhead" successes, however, the gap between data and clinical utility continues to grow.

In recent years, genomics has been described as a big data science on par with the likes of Twitter, YouTube, and the scientific pursuit of understanding the universe.

Precision medicine has quickly moved to the forefront of clinical research and practice, and is particularly pertinent to cancer since cancer is a disease of the genome. The need to accelerate discovery in cancer research has been further propelled by the Beau Biden Cancer Moonshot, challenging the community to make a decade's worth of progress in five years.

The recent weeks have been momentous as the high-performance computing (HPC) community embraced the challenge of precision medicine. The theme of this year's leading international supercomputing conference, SC16, was "HPC Matters" and it was evident that HPC matters to precision medicine and that precision medicine matters to the high-performance computing community.

These days there seems to be a lot of talk about atlases for cancer. Most of us are familiar with The Cancer Genome Atlas (TCGA), the long-running effort which, over the past decade, sequenced genomes from thousands of tumor samples covering dozens of cancer types. TCGA catalogued the complex patterns of gene mutations underlying tumors, implicated numerous new cancer genes, and is generally viewed as a resounding success.

Scientific discovery involves collecting and analyzing data, and communicating new knowledge arising therefrom. What happens, though, when someone wants to repeat an experiment, or build on an existing approach? For this to happen, there needs to be sufficient information in the public domain and data that is accessible and understandable to the scientist.

In recent years, Challenges have become a popular way to engage and motivate the research and innovation communities to solve difficult problems. Challenges are open competitions where communities are presented with specific and often difficult problems to solve. Participants are given guidelines and test data, and are challenged to compete to find the best solution. Open competition encourages innovative thinking, provides for broad participation, allows funders to set ambitious goals, and is a cost-effective way to encourage collaboration and generate novel solutions.

The cost of DNA sequencing has dropped more than one million-fold over the last decade, making it increasingly possible to discover the genetic basis of cancer and response to treatment.

NCI has launched the Genomic Data Commons (GDC), a system that will promote sharing of genomic and clinical data between researchers and facilitate precision medicine in oncology. The GDC was created to centralize, standardize, and broaden access to data from NCI programs such as The Cancer Genome Atlas (TCGA) and its pediatric equivalent, Therapeutically Applicable Research to Generate Effective Treatments (TARGET).