Cancer Data Science Pulse
An Introduction to Cloud Computing for Cancer Research
The advantages of working with cancer research data in a cloud environment are undeniable. Through cloud computing, researchers can access petabyte-scale data (and beyond), representing the full range of biomedical information available today—genomics, proteomics, metabolomics, transcriptomics, as well as clinical and epidemiological data.
Over the last decade, we’ve made significant progress in bringing researchers to these data, all within an ecosystem equipped with the necessary power and resources to drive in-depth, complex analysis using established workflows and tools.
As a result, cloud computing is proving especially useful in accelerating cancer research for precision medicine. For example, our team at NCI, along with researchers from Massachusetts General Hospital, recently built a biomedical informatics pipeline leveraging two cloud-based research data ecosystems—NCI’s Cancer Research Data Commons (CRDC) and NIH’s All of Us Research Program. Our pipeline integrated publicly available data sets, interoperability standards, and data science approaches, which were then applied to several scenarios in precision medicine and precision oncology.
Indeed, cloud computing enabled us to conduct analyses on hundreds of thousands of patients, and those calculations took only a few months, not years, to both perform and publish. Over the past few months, we have presented this work at several medical, technical, and public health conferences: the Third RAS Initiative Symposium, the Observational Health Data Sciences and Informatics Global Symposium, the Global Alliance for Genomics and Health 9th Plenary meeting, and the Transdisciplinary Conference for Future Leaders in Precision Public Health. The findings also have led to published studies on pharmacogenomics and genomic testing in the Journal of Medical Genetics and JCO Clinical Cancer Informatics, respectively.
Costs, Access, and Training
Despite these significant advances, we’ve yet to realize the full potential of cloud computing in technology-driven approaches to cancer research and care. Costs, access, training, and other issues all can serve as hurdles to researchers.
If you’re considering incorporating cloud computing into your research, a key first step is to understand how your project falls into specific cloud cost categories. We recommend starting with the following categories: cloud storage, cloud compute, cloud data transfer, and cloud services.
Cloud storage involves the way data are accessed for research. As expected, the more data you store, the higher the costs. But storage retrieval times (i.e., latency) can also affect costs. Researchers who require immediate access to data sets can expect to pay more than those with more flexible data set access needs.
Cloud compute costs involve the specific hardware and software resources needed to perform your research analyses. Users often have the option of customizing cloud environments to fit their research needs, with more highly configured and feature-rich virtual machines costing more. You typically also pay for any time on the cloud server, with the billing “meter” running while actively (or even passively) conducting any analysis. Finally, your desired level of cloud availability can strongly influence costs. For example, costs to conduct analyses “on-demand” (i.e., whenever you want) are higher than preemptible or spot instances, which offer steep discounts if you are flexible with the timing and duration of your analysis.
Cloud data transfer costs are linked to performing data uploads (“ingress”) or downloads (“egress”), as well as time spent moving data between different cloud billing regions. Finally, the costs of vendor-specific cloud services can vary greatly according to the type and complexity of technologies being leveraged, including natural language processing, data analytics/visualization, and artificial intelligence tools (e.g., machine learning, deep learning).
In addition to the cloud-cost categories described above, staff expertise and training resources will also be critical. Cloud technologies change and evolve rapidly, with more platforms, workflows, and services being added every day. There is a growing need for research teams capable of handling the increasingly multidisciplinary aspects of biomedical research data.
Investigators should plan for cloud-related costs, training, and staffing requirements in advance. By proactively categorizing and prioritizing cloud costs, assessing technical resources, and making informed project decisions, researchers will be able to perform successful, scalable cancer research in the cloud. (For a use case-driven example, please see the paper, “Practical Aspects of Implementing and Applying Health Care Cloud Computing Services and Informatics to Cancer Clinical Trial Data.”)
Resources from NCI’s CRDC
NCI recognized that it can be a daunting task to learn a new system while concurrently minimizing costs and performing cutting edge research. For this reason, NCI offers cloud credits for compute and storage on the CRDC to allow users to “kick the tires” and become familiar with the platform.
NCI’s CRDC also offers resources that lower entry barriers and help researchers tap into cloud computing power without requiring deep technical skills. These resources include tutorials, quick-start guides, and practical examples of tools being used in cancer research today. For more information, please check out NCI’s Cloud Resources:
- Cancer Genomics Cloud, powered by Seven Bridges;
- Broad Institute’s Firecloud, powered by Terra; and
- ISB’s Cancer Gateway in the Cloud.
(For additional information, see these recent blogs, “NCI’s Cloud Resources Help Tame Today’s Data Windfall,” “Cloud Resources: Cancer Genomics Cloud Helps Power Data Discovery and Analysis to Advance Cancer Research,” and “ISB-CGC Cloud Resource: Providing Researchers with Shortcuts to Data Analysis.”)
In summary, the future looks bright for cloud computing, with many researchers beginning to use cloud resources routinely for large-scale cancer research. However, responsible engagement and broader adoption by the cancer research community will be necessary to help us unlock the full potential of precision medicine and precision oncology and improve care for all patients.