Preparing Genomic Data for NCI Repositories

data standards - decorative image

NCI datasets strive to achieve Findable, Accessible, Interoperable, and Reproducible (FAIR) data standards. Conforming data to NCI expectations to ensure that data can be reused, support the quality and usefulness of the submitted datasets, and create a more efficient process.
Note: Depending on how the dataset is funded (extramural, non-NIH funded, intramural), submission instructions will vary.

Genomic datasets should be prepared as follows:

  • Clean the data before it is submitted (e.g., the analytical dataset is finalized).
  • Submit data pertinent to the interpretation and reproducibility of the data including:
    • associated phenotype data (e.g., clinical information)
    • exposure data
    • descriptive information (e.g., protocol or methodologies used)
    • metadata around the experiment or study and annotations that are necessary to reproduce any published table or analysis must be included with genomic data submissions.
    • Linked terms for disease, cell type, tissue type, and other annotations to the NCI Thesaurus (NCIt).
    • Identifiers, such as Uniform Medical Language Systems (UMLS) or an ontology term from an existing ontology, if an NCIt identifier is not available.
    • Common data elements (CDEs) wherever possible. For clinical specimens, the same data elements reported to clinicaltrials.gov are required.
  • Detail specimen acquisition, experimental procedures, and data processing and analysis methods (e.g., alignment algorithms, software versions, etc.) with your data submission.

For detailed information on timeline and process for data submission and release based on the level of data, review the Supplemental Information for the NIH Genomic Data Sharing Policy

File Formats for Submitting Genomic Data 

Data types undergo different levels of data processing and will determine expectations for data submission and release. Work with the Program Officer to determine specific data submission requirements as they may differ based on individual program and data type.

The Genomic Data Sharing (GDS) Policy provides guidance by level of genomic data:

  • Level 0: Raw data generated directly from the instrument platform (this data would not be accepted by NCI repositories)
  • Level 1: Initial sequence reads, the most fundamental form of the data after the basic translation of raw input
  • Level 2: Data after an initial round of analysis or computation to clean the data and assess basic quality measures
  • Level 3: Analysis to identify genetic variants, gene expression patterns, or other features of the dataset
  • Level 4: Final analysis that relates the genomic data to phenotype or other biological states

Metadata should be submitted to share on an unrestricted basis concurrent with the relevant Level 1, 2, 3, or 4 genomic data. Metadata can include information around the experiment or study, or necessary information to interpret controlled-access genomic data, such as study protocols, data instruments, and survey tools.

Data Types by Level

This table describes examples for each level. NIH will review these expectations at regular intervals, and will publish updates on the GDS website and notify the research community through appropriate communication methods (e.g., NIH Guide for Grants and Contracts).

Data Type Level 1 Level 2 Level 3 Level 4

SNP array data from > 500K single nucleotide polymorphisms (SNPs)
(e.g., GWAS data)

  • .CEL
  • .TXT
  • .IDAT

Note: submission of .IDAT files for human sample data will be decided on a case-by-case basis

N/A

.TXT

.TXT

DNA sequence data from < 100 genes or regions of interest
(e.g., targeted sequencing)

N/A

.BAM

Arrays:

  • .TXT

NGS:

  • .MAF
  • .VCF 
  • .PED

.TXT

DNA sequence data from ≥ 100 genes, regions of interest
(e.g., targeted sequencing, whole exome sequencing, whole genome sequencing)

N/A

.BAM

Arrays:

  • .TXT

NGS:

  • .MAF 
  • .VCF
  • .PED

.TXT

RNA sequencing (RNA-seq) data
(e.g., transcriptomic and targeting RNAseq data)

  • .FASTQ
  • .SFF
  • .HDFS
  • Complete genomics native

Note: required for human sample data only
 

N/A

Arrays:

  • .TXT

NGS:

  • .WIG
  • .TXT

.TXT

Genome-wide DNA methylation data
(e.g., bisulfite sequencing data)

N/A

.BAM

Arrays:

  • .TXT

NGS:

  • .MAF
  • .VCF
  • .TXT
  • .BED

 

Genome-wide chromatin immunoprecipitation sequencing (ChIP-seq) data
(e.g., transcription factor ChIP-seq, histone modification ChIP-seq)

N/A

.BAM

Arrays:

  • .TXT

NGS:

  • .WIG
  • .TXT 
  • .BED

.TXT

Metagenome (or microbiome) sequencing data
(e.g., 16S rRNA sequencing, shotgun metagenomics, whole-genome microbial sequencing)

N/A

.BAM

NGS:

  • .WIG
  • .TXT

.TXT

Metatranscriptome sequencing data
(e.g., microbial/microbiome transcriptomics)

N/A

.BAM

NGS:

  • .WIG
  • .TXT

.TXT

Note: Metadata or other data pertinent to the interpretation of genomic data—such as associated phenotype data (e.g., clinical information), exposure data, and descriptive information (e.g., protocol or methodologies used) should be shared. Metadata around the experiment or study and annotations that are necessary to reproduce any published table or analysis must be included with genomic data submissions.

 

For additional examples of data that falls under the scope of the GDS policy, review Supplemental Information to the National Institutes of Health Genomic Data Sharing Policy.

Updated: