Population Level Pilot: Population Information Integration, Analysis, and Modeling for Precision Surveillance

The population level pilot of the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program applies natural language processing (NLP) and deep learning algorithms to population-based cancer statistics collected by the National Cancer Institute (NCI) Surveillance, Epidemiology, and End Results (SEER) program.

The goal of this pilot is to advance cancer care by using computer-assisted technologies, such as artificial intelligence and advanced analytics, to transform our knowledge of population-based cancer data. This project will help us better understand the impact of new diagnostics, treatments, and other factors affecting patient trajectories and outcomes.

Pilot Leads

With shared expertise across the JDACS4C collaboration, this pilot is jointly led by:

  • Dr. Lynne Penberthy, National Cancer Institute, Division of Cancer Control and Population Sciences
  • Dr. Georgia “Gina” Tourassi, Department of Energy, Oak Ridge National Laboratory

Aims of the Pilot

  • Automate the way information is captured from reportable cancer surveillance data items in unstructured clinical text to improve the capacity of the NCI SEER program
  • Advance the integration and analysis of extreme-scale graphics, visuals, and data management systems to better understand drivers in patterns of cancer outcomes and predict clinical endpoints
  • Promote data-driven modeling of patient-specific and population-level health trajectories to guide precision cancer care

Progress to Date

Since the launch of the pilot in 2016, the team has:

  • Developed, deployed, and refined pathology report annotations to advance critical training data and validate computational models
  • Established data use agreements to access SEER cancer registry data to serve as a platform for deep learning applications
  • Applied NLP tools to automatically extract and code key features in pathology reports
  • Created a preliminary algorithm to automate a process for identifying reportable and non-reportable cases from millions of pathology reports; because these reports are received in real time, this algorithm will support the development of near-real-time cancer trends reporting
  • Developed an application programming interface (API) to automate coding for primary tumor site, histology, laterality, behavior, and grade from pathology reports with uncertainty quantification (shortened to “UQ,” which enables fully automated coding on a subset of all pathology reports by identifying and sending reports with low levels of accuracy to manual review)
  • Gathered current data sets to develop multiple algorithms to capture cancer recurrence with preliminary accuracy of recurrence coding of 96% for pathology reports
  • Constructed a knowledge graph to store and analyze cancer registry data sets to enable complex queries and iterative analyses
  • Developed a privacy-preserving model for the tumor site, histology, laterality, behavior, and grade API to protect any personal identifying information that is embedded in the model and expand the ability to share the trained model broadly

Future Development

In the future, together with the knowledge gained from the JDACS4C molecular and cellular level pilots, this population level pilot will offer:

  • Improved, real-time development of new, integrated sources of health and cancer surveillance information
  • Enhanced data through linkage activities to gain access to new and different sources, such as radiology reports, pharmacy, patient treatments, biomarkers, clinical trials, and social determinants of health data
  • Additional insight into the effects of real-world factors on patient health trajectories, eligibility for clinical trials, and prediction of cancer patient outcomes
  • More refined usability testing through a laboratory study to analyze and inform tool implementation and integration within the registry workflow


Was this page useful?