Cancer Data Science Pulse
The Protein Data Bank and the Importance of Sustaining Primary Data Archives
Scientific discovery involves collecting and analyzing data, and communicating new knowledge arising therefrom. What happens, though, when someone wants to repeat an experiment, or build on an existing approach? For this to happen, there needs to be sufficient information in the public domain and data that is accessible and understandable to the scientist.
Some journals allow publication of supplementary data, but such provisions are far from ideal. For example, there are no easy ways to search these data. Today, best practices call for depositing primary data and metadata into a domain-specific archival repository prior to publication. These repositories must be trusted, stable, secure, and open access, so that others can build on previous work, reproduce findings, and advance the scientific enterprise.
Structural biologists understood the importance of archiving primary data when they established the Protein Data Bank (PDB) in 1971 with just seven protein structures. Since 2003, the Worldwide PDB (wwPDB) partnership has managed the PDB archive, ensuring that experimental data and metadata are expertly curated, validated, and made freely available.  The wwPDB is currently made up of:
- Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB)
- Protein Data Bank in Europe (PDBe)
- Protein Data Bank Japan (PDBj)
- BioMagResBank (BMRB)
Each wwPDB partner organization is supported by local funders, thereby spreading the financial burden of maintaining the PDB archive across different geographies. Looking ahead, additional wwPDB members could be recruited from Asia and Latin America as research and development spending rises in various emerging economies.
PDB statistics and usage are tracked and monitored for its impact on science.
PDB Archive size | 122,000 structures; over one billion non-hydrogen atom coordinates |
PDB Archive growth rate | Approximately 11,000 new structures each year |
Data downloaded | More than 500 million downloads of structure data/year |
Literature Citations | Over 20,000 since the year 2000 |
Unique users | Over 1 Million per year |
Areas of biology and medicine featured in the PDB archive | All of them! |
New science enabled by the archive | Structure-based Drug Discovery; Protein Design; Protein Structure Prediction; etc. |
What about the statements made by funders that there are insufficient monies available to sustain the many scientific data resources available today? This is undoubtedly true, but we should be careful to distinguish "archival repositories" for primary data from "derived databases" that aggregate information from other sources. Without primary data archives, valuable derived data resources (e.g., model organism databases) could not function, and entire areas of scientific inquiry could disappear.
The NIH and the academic community can take concrete steps to ensure this does not happen. Primary data archives must be sustained long-term. Funding needs to be provided at levels that ensure quality service to Depositor and User communities alike. Ongoing infrastructure investments must support 24/7/365 operations with global reach, full data integrity and security, quality control, and provisions for periodic hardware/software upgrades. Skilled personnel with necessary domain expertise must be recruited, trained, and adequately compensated to ensure they are not recruited away to big data companies.
New review mechanisms for data resources need to be established that focus on these requirements, which are necessarily distinct from criteria used to review research grants. In addition, new business models are urgently required for funding primary data archival repositories. Federal agencies should reserve a percentage of annual budget to support archival data repositories. From this set aside, funds could be allocated to individual resources at levels reflective of usage and impact metrics, with ongoing, rigorous review of the management team.
A strengthened commitment to sustaining existing archival data repositories and establishing new ones will help to support the global scientific enterprise by ensuring that primary data are safeguarded, experiments can be reproduced, and new discoveries are enabled.
* Rutgers, The State University of New Jersey
+ University of California San Diego"
Categories
- Data Sharing (56)
- Genomics (33)
- Informatics Tools (33)
- Data Commons (32)
- Data Standards (29)
- Precision Medicine (23)
- Seminar Series (22)
- Data Sets (21)
- Machine Learning (19)
- Artificial Intelligence (13)
- Leadership Updates (12)
- High-Performance Computing (HPC) (9)
- Imaging (7)
- Policy (7)
- Training (7)
- Funding (5)
- Jobs & Fellowships (4)
- Proteomics (4)
- Semantics (3)
- Information Technology (2)
- Publications (2)
- Awards & Recognition (1)
- Childhood Cancer Data Initiative (1)
Leave a Reply