A pipeline workflow illustrating the extration, transformation, and loading of data through the ISB-Cancer Gateway in the Cloud (CGC). ISB-CGC is a component of the NCI Cancer Research Data Commons. Title text reads, "Overview of ISB-CGC Data Preparation Process." A table on the left reads as follows: "25+ Data Sources/Programs. BEATAMIL. CCLE. CGCI. CMI. CPTAC. CTSP. FM. GENIE. HCMI. MMRF. NCICCR. OHSU. ORGANOID. TARGET. TCGA. VAREPOP. WCDT. CBTN. CPTAC. ICPC. Targetome. Reactome. Pan-Cancer Atlas. Georgetown Proteomics Research Program. Quantitative Digital Maps of Tissue Biopsies. 500k+ Files of Heterogeneous Data. WGS. DNA Seq WXS. RNA Seq (gene, isoform, exon, junction). SNP Array (CEL). DNA Seq (MAF, VCF). DNA Methylation. Protein (RPPA). Clinical & Biospecimen. miRNA Seq. SNP Array." This table and all the variables point to a section of the workflow titled, "Download Data via Multiple Protocols. APIs. HTTPS. SFTP." This part of the workflow is preceded by "Deploy Custom VMs. Memory. Disk. Network." and "Write Code for Data Source-Specific Pipelines." After these three opening parts of the workflow, the workflow continues: "Cloud Storage/Local VM Disk." These sections of the workflow are labeled as "Extract." The next sections of the workflow are labeled as "Transform." This starts with, "Convert & Standardize File Formats. CSV. TSV. XML. JSON." Then, "QC & Normalize Data. Missing Values. Inconsistent Value Formats. Deduplication." Then, "Create Files for BigQuery Import." The final sections of the workflow are labeled as "Load." There is a split between "Intermediate BigQuery Tables" and "Final BigQuery Tables." For the former, that then progresses to "BigQuery SQL Joins & Transformations. Normalize. Filter. De-Normalize. Pivot" before it concludes with "Final BigQuery Tables."
Overview of ISB-CGC Data Preparation Process, Graphic credit: John Phan, Ph.D., ISB-CGC