Cancer Data Science Pulse
Towards FAIR&R Cancer Data Analysis
In an era of unprecedented growth in the size and variety of datasets and the number of software tools, there is an ever-increasing need for frameworks that connect and integrate data and tools within a secure and easy-to-use research ecosystem. In the absence of such frameworks, data and tools often remain isolated, requiring researchers to become familiar with multiple computation environments, to perform costly and time-consuming data transfers between environments, and to repeatedly reformat data in order to pass it from one tool to the next. Fortunately, with growing emphasis on the 'Findable, Accessible, Interoperable, and Reusable' (FAIR) principles for tools and data alike, and major initiatives such as the National Institutes of Health (NIH) Data Commons Pilot, the National Cancer Institute (NCI) Cancer Research Data Commons (CRDC), and the National Human Genome Research Institute's (NHGRI) Analysis, Visualization, and Informatics Lab-Space (AnVIL), integrated data and computation ecosystems are becoming a reality.
The Knowledge Engine for Genomics (KnowEnG) is a collection of machine learning tools for genomic datasets. It uses an integrated knowledge network comprised of public datasets, gene/protein interactions, relationships, and annotations. For example, a user looking to find important pathways involved in novel cancer subtypes can query gene expression or mutation data from a cancer study in The Cancer Genome Atlas (TCGA), and run the KnowEnG Sample Clustering tool to group the patient profiles into subtypes. They can then find out which genes are relevant to each subtype using the KnowEnG Gene Prioritization tool, and submit the discovered genes to the KnowEnG Gene Set Characterization tool to identify the crucial pathways that are disrupted in each cancer subtype. The KnowEnG tools are being used for discovery as part of a collaboration with the Mayo Clinic in cancer drug response and to predict drug efficacy in patients (PMID: 28800781).
In order to analyze TCGA data with the KnowEnG tools, the researcher would first have to retrieve the data and perform primary analysis on the Seven Bridges Cancer Genomics Cloud (CGC) (or elsewhere), download the results, reformat the data, and then upload it to the KnowEnG platform for 'downstream' analysis. This is not only time-consuming but also requires users to be skilled to reformat data and manage their work in disparate systems.
In a collaboration between the KnowEnG and Seven Bridges teams, the KnowEnG tools were published as native workflows on the CGC, colocating them with petabyte-scale TCGA data, making them interoperable with the suite of other analytical tools available in the CGC and more accessible to the entire CGC user community. As a result, the complex workflows can now be executed directly on the CGC by invoking the KnowEnG tools without having to move either data or code. Moreover, under the hood, the KnowEnG big-data Knowledge Network required for the analyses is integrated and made available for use regardless of its physical location. One could envision bringing other custom collections of knowledgebases¬†into this analysis framework.
Two community-developed solutions enabled publication of the KnowEnG tools on the CGC. The first was docker - the KnowEnG tools made available as 'docker containers'. The second was to leverage the CGC's ability to execute external tools and workflows using the Common Workflow Language (CWL). Such a packaging also makes the KnowEnG tools portable and executable on any platform that supports CWL. With the KnowEnG tools already in docker containers, they simply needed to be described in CWL and published as public tools in the CGC.
This mechanism to "publish" tools redefines what it means to make tools available in the open-source. Rather than releasing non-portable code in a GitHub repository, such a release of fully annotated, optimized, and automated tools is expected to democratize access to the entire cancer research community already using the CGC. Toolkits such as Seven Bridges' open-source Rabix Composer that help researchers containerize tools and workflows using simple visual and text-based editors, as well as the international efforts such as the Global Alliance for Genomics and Health (GA4GH), working to develop standards and frameworks for tool portability, ¬†are expected to play an important role in the push towards greater interoperability as well as sustainability of tools.
Publishing the KnowEnG tools thus has now made them FAIR&R:
- Findable through the Docker Hub repository,
- Accessible for immediate use by researchers to analyze their own data in conjunction with data sets available on the CGC through the Public Apps collection,
- Interoperable with the hundreds of other tools and workflows available on the CGC and other platforms and,
- Reusable in a Reproducible manner through documentation of the inputs and parameters used for each analysis run on the CGC.
The success of such efforts to allow interoperability of different research systems relies on both technology development and cultural shifts. Continued community development and adoption of toolkits as used in this project, along with availability of interoperable data storage and analysis platforms and an increased understanding of the benefits of FAIR principles, is expected to catalyze the development of the next generation of interoperable research ecosystems required to accelerate biomedical discovery.
Dr. Saurabh Sinha is a Professor and Willett Faculty Scholar in the Dept. of Computer Science and Faculty at the Carl Woese Institute of Genomic Biology at the University of Illinois, Urbana-Champaign. Liz Williams was Scientific Program Manager at Seven Bridges Genomics for this project. Ishwar Chandramouliswaran is a Bioinformatics Program Officer at NIH."