Cancer Data Science Pulse

The Protein Data Bank and the Importance of Sustaining Primary Data Archives

Scientific discovery involves collecting and analyzing data, and communicating new knowledge arising therefrom. What happens, though, when someone wants to repeat an experiment, or build on an existing approach? For this to happen, there needs to be sufficient information in the public domain and data that is accessible and understandable to the scientist.

Some journals allow publication of supplementary data, but such provisions are far from ideal. For example, there are no easy ways to search these data. Today, best practices call for depositing primary data and metadata into a domain-specific archival repository prior to publication. These repositories must be trusted, stable, secure, and open access, so that others can build on previous work, reproduce findings, and advance the scientific enterprise.

Structural biologists understood the importance of archiving primary data when they established the Protein Data Bank (PDB) in 1971 with just seven protein structures. Since 2003, the Worldwide PDB (wwPDB) partnership has managed the PDB archive, ensuring that experimental data and metadata are expertly curated, validated, and made freely available.  The wwPDB is currently made up of:

Each wwPDB partner organization is supported by local funders, thereby spreading the financial burden of maintaining the PDB archive across different geographies. Looking ahead, additional wwPDB members could be recruited from Asia and Latin America as research and development spending rises in various emerging economies.

PDB statistics and usage are tracked and monitored for its impact on science.

PDB Archive size 122,000 structures; over one billion non-hydrogen atom coordinates
PDB Archive growth rate Approximately 11,000 new structures each year
Data downloaded More than 500 million downloads of structure data/year
Literature Citations Over 20,000 since the year 2000
Unique users Over 1 Million per year
Areas of biology and medicine featured in the PDB archive All of them!
New science enabled by the archive Structure-based Drug Discovery; Protein Design; Protein Structure Prediction; etc.

What about the statements made by funders that there are insufficient monies available to sustain the many scientific data resources available today? This is undoubtedly true, but we should be careful to distinguish "archival repositories" for primary data from "derived databases" that aggregate information from other sources. Without primary data archives, valuable derived data resources (e.g., model organism databases) could not function, and entire areas of scientific inquiry could disappear.

The NIH and the academic community can take concrete steps to ensure this does not happen. Primary data archives must be sustained long-term. Funding needs to be provided at levels that ensure quality service to Depositor and User communities alike. Ongoing infrastructure investments must support 24/7/365 operations with global reach, full data integrity and security, quality control, and provisions for periodic hardware/software upgrades. Skilled personnel with necessary domain expertise must be recruited, trained, and adequately compensated to ensure they are not recruited away to big data companies.

New review mechanisms for data resources need to be established that focus on these requirements, which are necessarily distinct from criteria used to review research grants. In addition, new business models are urgently required for funding primary data archival repositories. Federal agencies should reserve a percentage of annual budget to support archival data repositories. From this set aside, funds could be allocated to individual resources at levels reflective of usage and impact metrics, with ongoing, rigorous review of the management team.

A strengthened commitment to sustaining existing archival data repositories and establishing new ones will help to support the global scientific enterprise by ensuring that primary data are safeguarded, experiments can be reproduced, and new discoveries are enabled.

Structure of the Abl protein kinase catalytic domain bound to a first-generation therapeutic, imatinib (PDB ID 1IEP; orange), which inspired medicinal chemists at Novartis to design an even more effective second-generation agent, nilotinib (PDB ID 3CS9; green). Both are U.S. FDA approved drugs used for treatment of chronic myeloid leukemiaRutgers, The State University of New Jersey

* Rutgers, The State University of New Jersey

+ University of California San Diego"

Helen M. Berman, Ph.D.
Director Emerita, RCSB Protein Data Bank,
Stephen K. Burley, M.D., D.Phil.
Director, RCSB Protein Data Bank
Older Post
Cancer Genomics Cloud Pilots DREAM Challenge - Leveraging the Wisdom of the Crowd
Newer Post
The Real Value of an Atlas

Leave a Reply

Vote below about this page’s helpfulness.

Your email address will not be published.

CAPTCHA Image CAPTCHA