Cancer Data Science Pulse
Breaking Down Data Silos: The Urgency of Now
Now is the time for researchers across domains to ideate together, share data, and maximize the utility of those data. This is "the urgency of now" according to former Vice President Joe Biden, who delivered the keynote address to those in attendance at the September 2017 Human Proteome Organization (HUPO) Annual World Congress. Opening with the 2017 Global Leadership Gala Dinner held at the Royal College of Physicians of Ireland in Dublin, the Congress' dinner attendees included world leaders in proteomics and proteogenomics, the large-scale study of proteins and the genes in which they are encoded.
Biden's dinner address spoke of the importance of cross-disciplinary cancer research and the urgent need to "breakdown data silos." Biden praised HUPO researchers, including the International Cancer Proteogenome Consortium (ICPC), noting that their work would lead to new innovations and would ultimately save tens of millions of lives.
Prior to Biden's remarks, Dr. Jerry Lee, Deputy Director for the Center for Strategic Scientific Initiative (CSSI), spoke on behalf of Dr. Doug Lowy, then Acting Director of the National Cancer Institute. Dr. Lee set the stage by describing the behind-the-scenes work that contributed to The Cancer Genome Atlas (TCGA), the early stages of the Clinical Proteomic Tumor Analysis Consortium (CPTAC), and his experience working with the Office of the Vice President's Cancer Moonshot Taskforce. Dr. Lee gave his personal thanks to Biden, "for lending his voice to extend our commitment to data sharing launching the NCI Genomic Data Commons....For lending his voice to converge several agencies to think about proteogenomics... For lending his decades of foreign policy experience to reach out and solidify proteogenomics [Memoranda of Understanding] with several countries..." Each anecdote Dr. Lee shared built up to two major announcements: 1) NCI has launched a new video about the value of proteogenomics research in cancer medicine; and 2) NCI has begun the development of a Proteomic Data Commons. The latter would continue Biden's charge to enhance data sharing and complement the NCI Genomic Data Commons, which launched in 2016.
Proteogenomics has evolved out of recognition that studying the expression and modification of both genes and proteins together can provide a more complete picture than studying either of them alone. Not too long ago, genomics and proteomics labs were somewhat isolated, and sharing data across fields was cumbersome. Other domains such as epidemiology and radiological imaging were also siloed. Researchers could download data from public data repositories or request data directly from other labs, attempt to format and parse the data, and make the data compatible with their in-house scripts and software. This work was limited by the institute's local storage and compute capacity and was also limited by budget, time, and a slew of other obstacles. Integrating across data types remains difficult, but over the past couple of decades, the need to integrate multiple data types has been increasingly appreciated, and the infrastructure to support this type of cross-disciplinary science has been in increasingly high demand.
NCI Proteomic Data Commons: CSSI-CBIIT Collaborations
NCI's existing Proteomics Data Coordinating Center has been tasked with managing and distributing data from select NCI proteomics programs. However, with the rise of proteogenomics, the inception ICPC, and the launch of programs such as the Applied Proteogenomic Organizational Learning and Outcomes (APOLLO) Network, a new model to support this proteogenomics data management effort is crucial.
A joint NCI CSSI-Center for Biomedical Informatics and Information Technology (CBIIT) team led by Dr. Chris Kinsinger from CSSI and Dr. Izumi Hinkson from CBIIT has awarded a contract to Enterprise Science and Computing to prototype the Proteomic Data Commons (PDC) as a component of the recently announced NCI Cancer Research Data Commons (CRDC). The PDC will be a cloud-based system that interoperates with the existing Genomic Data Commons and Cloud Resources within the NCI CRDC, enabling researchers to perform biomedical data analyses at scale. The NCI CRDC aims to support cancer research initiatives through sustaining multidisciplinary bioinformatics resources. These include data knowledge bases, or nodes, centered on different research and clinical data including proteomics. As a node in the NCI CRDC, the PDC will be more than a repository for data storage, the PDC will provide seamless integration of proteomics data with other data types hosted in the CRDC. As experts in proteomics, Dr. Hinkson and Dr. Kinsinger will ensure that the overarching goal of the PDC will be to democratize access to cancer-related proteomic datasets and to provide sustainable computational support to the proteomics and cancer research communities.
This new endeavor by the NCI is designed to address the need for increased collaborations across biomedical domains and provide the infrastructure necessary to facilitate these collaborations. The PDC will host large data sets such as those generated by CPTAC, ICPC, and APOLLO. Ultimately, through the PDC and more generally, the CRDC, proteomic data will be shared more broadly and reach audiences in fields that have not traditionally re-used proteomic data. Dr. Hinkson describes one of the PDC's main goals as, "providing a means for researchers to perform data analyses regardless of their level of proteomic expertise." This will allow users, from a range of clinical and research fields, to perform cross-disciplinary analyses of NCI-generated data in combination with their own data.
Aligned with the urgent challenge described by Biden, NCI is working with public and private sector partners towards creating and optimizing infrastructures such as the PDC and other nodes of the NCDRC to support this mission.