Cancer Data Science Pulse
Missing pieces for a cancer-focused learning healthcare system
We are in a time of transformation in cancer informatics. Many years of work to make The Cancer Genome Atlas (TCGA) data more widely available through Cloud Pilot implementations will come to fruition in the next few months. At the same time new programs, such as the National Cancer Moonshot and the Personalized Medicine Initiative, are being inaugurated. And, it's becoming possible to envision a time when our ability to deeply investigate the cancer genome will drive our diagnostic decisions and refine therapeutic options for the majority of our patients, moving us toward a cancer-focused "learning healthcare system."
Innovations in Bioinformatics and Big Data technologies will play a critical role in this future, advancing our ability to process and analyze unprecedented volumes of high dimensional omics data, potentially across time (progression) and space (tumor heterogeneity). But let's not forget that informatics innovation is needed across the entire spectrum of our field, including Clinical Informatics and Population Informatics. In fact, our ability to deeply investigate the cancer genome is outpacing our ability to relate these changes to the phenotypes that they produce. This correlation between tumor behavior and outcome is a necessary intermediate step towards our future goals. Armed with sufficient data across very large populations, it seems plausible that a learning healthcare system can emerge. But what do we have to do to get there?
I'd argue that two of the most immediate and critical missing pieces include (1) more efficient and scalable methods for sharing cancer data and biospecimens, and (2) more sophisticated methods for extracting and representing cancer clinical phenotypes.
Cancer Data and Biospecimen Sharing
One fundamental problem is the need to work with data and biospecimens beyond the borders of our own institutions. Often the number of available cases at any one institution is insufficient to adequately power comparative studies. Particularly challenging are cohorts with specific temporal features (for example, cancers preceded by high-risk lesions), rare cancers, common cancers with rare phenotypes, and tumors associated with clinical molecular profiling.
Our current and cumbersome methods for collaboration (point-to-point data use agreements and materials transfer agreements) do not scale to support the level of sharing that is needed for these use cases. An important step forward would be the formation of a national cancer data, image, and biospecimen sharing network. Such a network could have significant impact on translational research, personalized medicine, disease surveillance, and quality improvement. And we may be closer to being able to construct such a network than we have ever been before.
Work on Patient Centered Outcomes Research Institute (PCORI)¬†and National Center for Advancing Translational Science's (NCATS)¬†Accrual of patients to Clinical Trials (ACT) networks point to widespread acceptance of the concept. In our recently published work 1 on the Text Information Extraction System (TIES) Cancer Research Network,* my colleagues and I described the legal, policy, regulatory, and technical foundations for a network that is already sharing de-identified clinical data, images, and biospecimens across multiple cancer centers. We argue that federation is a key success determinant, because it support the needs of participating organizations to leverage and protect their own data, while gradually converging towards standards, and a more open and collaborative environment.
A second important area of research involves the extraction and representation of cancer phenotypes from electronic health records. Stage, tumor extent, recurrence, response to therapy, and outcome are important phenotypic variables that we need to reliably extract from an increasingly complex electronic health record for every cancer patient. But less common phenotypes could also produce valuable insights. Why do some high risk lesions advance to invasive cancers while others regress? Why are some cancers multifocal? Are there differences between heritable and sporadic cancers that tell us more about the molecular drivers? What genomic profiles are associated with metastasis to specific organ systems? Big Data and datamining approaches to these kinds of questions will rely on detailed phenotypic data that is not systematically collected at the present time.
As a first step, it will be critical to adequately represent our current understanding of cancer phenotypes. Fortunately, many advances in phenotype representation have been made over the past decade. The Human Phenotype Ontology and the Monarch Initiative are two examples that provide a wealth of methodologic guidance. Similar efforts are needed to develop computable cancer phenotypes. At the same time, phenotype extraction methods must be developed that leverage natural language processing to identify and represent key variables from the electronic health record. The Cancer Deep Phenoytpe Extraction (DeepPhe) project,* a collaboration of University of Pittsburgh and Boston Children's Hospital, is one example of a project that blends ontology-based phenotype representation with state-of-the-art phenotype extraction methods. Significant further efforts and multiple research groups are needed to help us build a research community that shares these important goals.
As we further develop our methods for cancer phenotyping, we should not miss the opportunity to partner with and support our nation's cancer surveillance network. At the front line of phenotyping, cancer registries and registrars are natural allies for informatics innovators that seek to integrate molecular profile and phenotype. Data gathered on reportable cases of cancer can be used as reference standards for developing better methods for information extraction. At the same time, methods we develop for federated data sharing networks, cohort discovery, and automated abstraction could be leveraged to support the cancer surveillance mission. By working together, we can affect greater and faster change to help reduce the burden of this disease.
* The NCI Informatics Technology for Cancer Research (ITCR) provides funding for the DeepPhe (U24CA184407) and TIES (U24CA180921) projects.
- Jacobson RS, Becich MJ, Bollag RJ, Chavan G, Dhir R, Corrigan J, Feldman M, Gaudioso C, Legowski E, Maihle N, Mitchell K, Murphy M, Sakthivel M, Tseytlin E, Weaver J. A Federated Network for Translational Cancer Research Using Clinical Data and Biospecimens. Cancer Research 2015; 75(24):5194-201. PMID:26670560 PMCID:PMC4683415