Data Quality for LLMs: Building a Reliable Data Foundation

Seminar Series

April 24, 2024 11:00 a.m. - 12:00 p.m. ET

Watch the Recording

If you use large language models (LLMs) in your cancer research, register for this seminar to hear Elucidata’s Dr. Abhishek Jha discuss how data quality impacts LLM performance.

A reliable foundation that is well annotated and accessible to an LLM plays a major role in the value of its results.

You’ll see examples of how LLM-powered artificial intelligence (AI) agents query across three versions of the same gene expression corpus with differing results, including:

unstructured data from the public repository Gene Expression Omnibus.
structured data from the Crowd Extracted Expression of Differential Signatures project (tool developed by the Ma’ayan Lab at the Icahn School of Medicine at Mount Sinai).
clean, linked, and harmonized data.

Dr. Jha will use these examples to discuss how the different quality in these data sources impacts LLM performance.

Abhishek Jha, Ph.D.

Dr. Jha is the co-founder and CEO at Elucidata. Previously, he was a senior scientist at Agios Pharmaceuticals, which has successfully brought three first-in-class drugs for acute myeloid leukemia patients to the market. His experience at Agios inspired him to build Elucidata. Elucidata’s mission is to help scientists save time on routine data and machine learning operations tasks so they can shift their focus to high-value research. This eventually helps patients receive drugs sooner. Elucidata is building technology and solutions for R&D teams to better leverage data and reduce their drug development timelines. Dr. Jha is an alumnus of Massachusetts Institute of Technology, the University of Chicago, and the Indian Institute of Technology Bombay.