Metadata Services for Cancer Research

Here at NCI, we use shared data standards based on “metadata,” to ensure mutual understanding and consistent data formatting, which increases the value of all cancer data. These standards promote data sharing and aggregation across research data repositories, which ensures that data is “FAIR” (findable, accessible, interoperable, and reusable).

To help the oncology research community create, manage, and use metadata, we established the cancer Data Standards Registry and Repository (caDSR) infrastructure in 2000. A newer infrastructure—caDSR II—took its place in May 2022.

Use this webpage to learn more about the elements that make up cancer metadata, the valuable caDSR II infrastructure, and the resources for supporting the community that uses it.

Metadata Content

Metadata content consists of Common Data Elements, Case Report Forms, Data Collection Templates, and Data Models hosted in caDSR.

Common Data Elements (CDEs)

What are CDEs? A CDE binds a research question with its allowed responses, defining the precise meaning consistently and commonly amongst different groups and clinical trials. CDEs contains information (metadata) about the actual piece of data collected, making the data consistent and both human and machine readable.
Why are they important? They code the data collected (making harmonization easier), and they facilitate aggregation and analysis of data from different groups and trials. CDEs also speed a study’s start time, decreasing cost and effort.
How does NCI use them? We primarily derive our CDEs from data collection forms, templates, and user data dictionaries produced to support NCI clinical trials. Healthcare and scientific research communities host CDEs derived from Data Models.

Case Report Forms (CRFs) and Data Collection Templates (DCTs)

How are CRFs built, exported, and accessed? Researchers use the caDSR II web application to build CRFs and DCTs using CDEs. Export formats include Excel spreadsheets, XML documents, Medidata Rave Architect Loader Specifications (ALS), and REDCap® Data Dictionaries. CRFs are also accessible via REST application programming interface (APIs).
What can CRFs be used for? Researchers use CRFs or other DCTs to collect or acquire data from cancer research studies, aggregated cancer research data, data warehouses, and commons. In data management systems, these CRFs and DCTs collect data in a standardized way across studies. In REDCap format, researchers can import the file to directly configure a REDCap system for data collection. In Medidata Rave, researchers can import them to create a library for developing a new study for data collection.
What do caDSR II CRFs include? They include standards for all NCI-sponsored clinical trials, Patient-Reported Outcome measures, Eligibility Criteria, and other research instruments. They also include data elements in many categories such as demographics, treatments, labs, disease responses, and pathology.

NCI Standard CDEs and Template Forms

NCI data collection standard CDEs and template forms are embedded in all NCI clinical trials. The standard CDEs make it easier to review the safety, efficacy, and administrative data from ongoing NCI-funded clinical trials. This NCI standard core library allows faster initiation of new trials by reducing the time spent developing a data collection strategy per trial, which improves the safety and delivery speed of new and improved oncology treatments to cancer patients.

Background

The recommendations of the 2004/2005 Clinical Trial Working Group influenced the creation of NCI standard CDEs and template forms. In 2017, the U.S. Food and Drug Administration (FDA) announced national guidance that required researchers/organizations to submit all Investigational New Drug (IND) trials using the Clinical Data Interchange Standards Consortium (CDISC) reporting standard: the Study Data Tabulation Model (SDTM). NCI engaged in a complex, 5-year harmonization effort that included adopting the CDISC Clinical Data Acquisition Standards Harmonization (CDASH) model for all trials. This harmonization effort eased the institutional burden of transformation to CDISC SDTM when submitting trial data to the FDA. As part of that effort, NCI aligned the NCI data collection standards and template forms with CDISC, CDASH, and SDTM models. The primary focus of this activity was not to change the existing NCI data collection standard CDEs and template forms. Rather, it was to create a second version aligned with the CDISC standards to enable study builders to easily map the NCI standard CDEs to CDISC variables for FDA submission of IND trial data sets in SDTM format.

Models

Information Model
- What is it? A software engineering representation of the concepts about specific real-world entities in a subject domain (i.e., cancer research and clinical care). It describes the concepts and relationships, constraints, rules, and operations that can be performed, like representing the fact that a patient has a medical history. Information Models do not address how data should be stored or validated.
- Who uses it? Anyone wanting to understand the relationships between entities in the domain covered by the Information Model. Also, developers can use these models to inform the design of Data Models and software systems.
- What’s a good example of one? The Biomedical Research Integrated Domain Group (BRIDG) Model is a good example of an Information Model. It provides a shared view of the dynamic and static semantics for basic, pre-clinical, clinical, and translational research and its associated regulatory artifacts.
Data Model
- What is it? A design for data storage organizing data elements that may be found in one or more Information Models. It describes the entities, such as tables and columns, for storing data that conform to the Data Model. Your use cases influence the data model’s design, including how to access, query, and analyze the data.
- Who uses it? Data engineers use Data Models to create databases and software systems that will collect and store data. Implementation of a data model can be in one or more representation formats (such as SQL, RDF, or XML).
- Why is it important? In addition to the use of CDEs, the use of a data model (regardless of its physical representation) helps ensure that data using the same data model is interoperable.

Cancer Standards Registry and Repository (caDSR II) Infrastructure

What is it? caDSR II consists of a database, APIs, and web-based tools for creating and using data standards for cancer research.
Why is it important? The infrastructure supports the official registration and management, such as versioning, of NCI standards for CDEs, CRFs, and Data Models used in NCI clinical trials. It also serves as a repository for oncology-related CDEs, CRFs, and models used in other cancer and healthcare research studies (e.g., the Observational Medical Outcomes Partnership [OMOP] Data Model).

caDSR II Technical Details

The data model for caDSR II is based upon the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) 11179 Metadata Registry standard (ISO/IEC 11179).
The Metadata Standards working group within the ISO and IEC developed the conceptual model that defines fields and relationships for 11179 metadata registries. This conceptual model allows registries worldwide to talk to each other and facilitates the reuse of content across similar infrastructures.
The underlying software framework for caDSR II is based on a commercial off-the-shelf product from Software AG. NCI uses webMethods OneData components and integration servers. OneData includes an Oracle database design based on the ISO/IEC 11179 Metadata Registry standard.
Per ISO/IEC 11179, the caDSR II infrastructure uses Concepts to annotate the semantic part of the CDE using the ISO/IEC 11179 structure and also the semantics of each permitted data value. The Concepts used in caDSR II come primarily from the NCI Thesaurus (NCIt) provided by NCI Enterprise Vocabulary Services (EVS). Optionally, when loaded into caDSR II, external terminologies or subsets of NCIt (such as MedDRA or UCUM) can help annotate permitted values. The concepts help to organize individual CDEs into semantically similar collections. This makes the CDEs easier to find and use in logical ways and can support interoperability and data transformation. The collections can be specific (i.e., for a group of CDEs related to pediatric cancer genomics) or general (i.e., those relating to diagnoses or therapies). Registering and reusing CDEs in NCI trials enables consistent data collection so that data coming out of trials is commonly structured for analysis. This makes it possible to aggregate data across data sets.
From the caDSR II homepage, you can browse, search, and export CDEs via several ways. You can search for content using various filters and then view and export CDEs, CRFs, and models in various formats using the Download Collections feature, which is also hyperlinked on the caDSR II homepage.

caDSR APIs

APIs provide another way to access caDSR II content.

This API uses REST APIs to retrieve content from the caDSR II database. Whether you’re a user or application developer, you can also use an HTTP interface to query content via a web browser. After selecting search parameters and submitting the query, you can paste the link in the web address bar into the user’s application to repeat the same query. The interface retrieves content in HTML, XML, and JSON formats. Learn more about caDSR II APIs and see examples. You may also find links to the API Portal at caDSR II API Swagger.

Centralized Curation Services and Support

caDSR II is a publicly accessible, community-based resource with embedded best practices and governance processes. Are you a member of the caDSR user community? If so, remember that these important resources are at your disposal:

Taking the burden off of you, the NCI CBIIT Centralized Curation Team’s responsibility is to help you register the details of CDEs in caDSR II and promote standardization and reuse of registered data elements across the community.
How can I get help? You can find information on the NCI caDSR II Metadata Management Knowledge Base Portal. This includes training, documentation, and user guides to assist with centralized curation CDE requests, caDSR II navigation, caDSR II Helpdesk navigation, form creation/management, answering frequently asked questions including information necessary to obtain additional support.
Who can help me? You may contact the team who provides metadata development services and is available to support organizations with the curation process and tool usage.
Are there others who can help? caDSR II community members are stakeholders who make up the caDSR Content User Group. The group holds quarterly meetings providing updates on practices that impact the caDSR II metadata content; sharing best practices; vetting business rules; participating in governance processes; answering user questions; and resolving issues. Sign up to be added to the caDSR listserv (caDSR.RA), and receive emails about urgent issues and notifications on common projects.

Updated: Aug 24, 2023