Cancer Data Science Pulse

Semantics Series—The Role of Common Data Elements and Artificial Intelligence

This blog continues our series on semantics by identifying ways that Common Data Elements (CDEs) play a role in the context of artificial intelligence (AI). We’ll discuss ways to reduce the burden for creating training data and for making data sets more accessible to AI applications.

AI has the power to make cancer research better. We train AI models on different types of data (more on that later). But to create a good model, you need good data. You also need a lot of data, because a good model needs to study patterns to make predictions.

To prepare good data, you need to devote time and labor into identifying, cleaning, and labeling (“feature engineering”) said data. You need Common Data Elements (CDE). If you use well-curated CDE metadata to validate, annotate, and enhance your biomedical data, you can reduce the time and labor burden and create a quality model for AI applications.

For almost 25 years, NCI has used CDEs for clinical trials data collection across all sorts of trial types and diseases. As of 2023, over 2,200 trial sites in the U.S. and abroad have used CDEs, thus standardizing data collection for over 675,000 participants in over 3,500 studies. But what exactly makes CDE metadata so great, and why are they especially important for making data ready for AI model training?

What Makes a Well-Curated CDE, and Why is it Important for Making Data Ready for AI Training?

  1. A CDE is machine readable.
    1. That’s important for training a quality AI model because you need the model to be able to easily understand data to then process data.
  2. It’s rich in context and includes a semantic layer (provided by the NCI Thesaurus, or NCIt) for establishing data relationships.
    1. The NCIt concepts are particularly important because they incorporate ontological relationships. Without these, AI systems can only develop relationships that are not scientifically verified—they will lack the underpinning that would allow them to be smart enough to make an impact1, 2.
  3. It adheres to strong data management principles.
    1. It’s important to train an AI model on clean, high-quality data for the sake of better accuracy. Quality comes from governance, versioning, provenance, and integrity (for safeguarding copyrighted information).
  4. Last but not least, a CDE standardizes data, including permissible values used in data collection.
    1. That’s important for training an AI model because data that are consistent and compatible across studies will minimize prep time and effort. The model will generate truthful information.

What Are Three Types of Data with Which You Can Train AI Models?

Biomedical Imaging (e.g., CT scans, X-rays, pathology images, and MRIs)

AI has helped advance image analysis in cancer research and patient treatment by improving the detection of early cancer lesions. Training such AI requires a huge amount of high-quality, annotated images and associated metadata. While humans train AI by identifying areas of interest and adding annotations, Digital Imaging and Communications in Medicine (DICOM) standards (an internationally recognized standard for acquisition and electronic communication of medical images) have accelerated AI’s growth in medical imaging as evidenced by over 950 FDA-approved, AI/ML-enabled devices3.

Unstructured (e.g., imaging reports, clinician notes, or patient-entered text)

This type of data makes up about 80% of Electronic Health Records4, 5. Large Language Models (LLMs) trained on vast text data can assist in tasks like diagnoses and trial matching by recognizing patterns in unstructured text. Although LLMs like OpenAI’s GPT, Google’s Gemini, and Anthropic’s Claude may sometimes generate errors, they can still effectively disambiguate and contextualize free text. LLMs can understand scientific concepts like “proteins, molecules, DNA, and RNA”6, especially when researchers optimize the models using an approach like RAG (retrieval-augmented generation) to leverage NCI cancer specific knowledge graphs, helping to accelerating research.

CDEs standardize question text for unstructured data, improving interoperability and learning accuracy across data from different studies. CDEs also contain terminology and ontology concepts, notably the NCIt, capturing scientific knowledge through semantic relationships like "TP53 Gene plays a role in Tumor Suppression."

Structured (e.g., local terms, codes, and collecting standard codes on forms using dropdowns or radio buttons and stored in tables)

This type of data accounts for the other 20% of healthcare data, but the percentage of clinical trial data may be significantly higher due to use of case report forms and CDEs optimized for data analysis7. NCI CDEs use terms and codes from NCIt, CTCAE, ICD, and other terminologies to capture structured data that can support AI models for rule-based systems such as clinical decision-making. CDEs help preprocess structured data through its use in validation, helping address issues like missing values.

However, structured data often lacks context, making manual feature engineering necessary to add textual context to help LLMs better process it. CDEs provide rich text metadata, such as labels, definitions, and concept annotations, which also enhance the meaning of structured data. CDE concept annotations provide access to synonyms and concept relationships, adding a useful semantic layer. Semi-automated processes can streamline converting structured data into unstructured text formats LLMs can easily use, reducing preparation time.

In Summary

CDEs play a pivotal role in enhancing AI effectiveness by providing rich metadata, improving data semantics, consistency, and quality. A rich semantic layer helps AI models form deeper connections and make inferences beyond those expressed explicitly in the data.


Footnotes

1 Seth Earley and Josh Bernoff, “Is Your Data Infrastructure Ready for AI?” Harvard Business Review, 2020.

2 While utilizing ontology concepts as CDE annotations aids data comprehension and AI model training, ontologies are inherently narrow due to their domain-specific nature. For example, over 1,100 biomedical ontologies in BioPortal encode knowledge in specific, sometimes overlapping, domains. Although top-level ontologies like Basic Formal Ontology (BFO) aim to coordinate ontology development, numerous overlapping ontologies persist. In contrast, independent concepts from standard terminologies like NCIt offer broader insights and linkages across domains, avoiding predesigned restrictions of concept meaning.

3 U.S. Food and Drug Administration, “Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices,” Updated August 7, 2024. (Accessed September 13, 2024.)

4 Kong, Hyoun-Joong, “Managing Unstructured Big Data in Healthcare System,” Healthcare Informatics Research, 2019.

5 EMERSE, “Data in Free Text.” (Accessed September 2024.)

6 Elastic Search, “What are large language models (LLMs)?” (Accessed September 2024.)

7 IgniteData, “How structured data is used in clinical trials,” September 2023. (Accessed September 2024.)

Science Program Analyst, NCI Center for Biomedical Informatics and Information Technology
Older Post
Synthetic Data Helps Counter Lack of Diversity in Data

Leave a Reply

Vote below about this page’s helpfulness.

Your email address will not be published.