Cancer Data Science Pulse
Spotlight on CBIIT Staff: Sherri de Coronado
In this new blog installment, the NCI Center for Biomedical Informatics and Information Technology (CBIIT) shines a spotlight on the staff who are working to turn data and IT resources into solutions for addressing data-driven cancer research. Look for new staff spotlights in the coming months as we explore the varied backgrounds and expertise that go into building, maintaining, and improving the infrastructure and use of Big Data at NCI.
Our newest “Spotlight” features Sherri de Coronado, program manager in the Cancer Informatics Branch at CBIIT. Much of her work centers on making data well described and compatible (or harmonized), allowing researchers and scientists to compare the results of different studies and to integrate different types of findings (on genomics, proteomics, comparative oncology, and more).
You wear several hats here at CBIIT. You oversee the Semantic Infrastructure Group and you also are involved in the Center for Cancer Data Harmonization. Can you expand on these initiatives?
Yes, as noted, we use the term “semantic” in our title. Semantics is the study of meanings. So semantics can help people clarify and set specific standards for terms used to describe data. Our Semantic Infrastructure (or SI) group provides guidance and resources to help users describe their data in ways that make the information more understandable and compatible.
Our team has worked on a variety of semantic resources, including the Cancer Data Standards Repository (caDSR), which standardizes data elements and case report forms from users in the research community. We worked with elements and forms to align them with the Clinical Data Interchange Standards Consortium’s (CDISC’s) standards. We also worked to standardize terminologies, such as those in the NCI Thesaurus, Gene Ontology resource, and lab codes (e.g., the Logical Observation Identifiers Names and Codes or LOINC).
Our SI group also is working closely with a newly launched NCI initiative, the Center for Cancer Data Harmonization (CCDH). This 3½-year effort will support the semantic needs of the Cancer Research Data Commons (CRDC). The CCDH will work with the CRDC repositories and data coordination centers to develop a harmonized model that supports analysis of data from across the CRDC, providing services to help them adopt or adapt semantics where appropriate.
What do you consider significant challenges for people working in the field of Big Data/bioinformatics/information technology?
Physics and other fields have figured out approaches to collaborating, collecting, and sharing data more quickly than biology and medicine, but we have the added complexities of many, many data types (genomic, metabolomic, proteomic, clinical, epidemiological, and more). Most importantly, we use human data, which creates very difficult privacy-related issues for data management and use.
The Human Tumor Atlas Network (HTAN) is a microcosm of these issues. They collect many types of data across different scales (single cell, bulk tissue, images at various scales and resolutions) as well as differing organisms. Those data sets may also include population data, individual patient characteristics, and other clinical and nonclinical variables. The HTAN user community needs to be able to understand and make use of (and compute on) the data they collect, and then they need to be able to share those findings. CRDC has similar needs.
It all comes back to standards. There are not enough and at the same time there are too many. Many kinds of data-related standards are being developed and adopted. Not just semantic content standards, but standards for documenting and executing workflows, standards for health data transmission and interoperability (like the Fast Healthcare Interoperability Resources or FHIR), standards for documenting images . . . . The list seems to go on forever. Standards are incredibly important for making data FAIR (that is findable, accessible, interoperable, and reusable), but it’s also incredibly time consuming to develop and adopt standards, and to keep pace as the science changes.
When you reflect over your career, is there a particular area where you feel things have changed significantly, or where you’ve had a chance to really make a difference?
I started at NCI around 1993 in the Office of Science Policy and Assessment (OSPA) as a program and policy analyst. At that time, I worked for Cherie Nichols, who hired me to be a part of the team. She was one of my earliest and most influential mentors, and she continues to have an active part in the NCI Cancer MoonshotSM today.
Much of my early work at NCI centered on planning and information gathering—e.g., for reporting, planning initiatives, and developing the annual Bypass Budget. It became apparent to us that a lot of information gathered across NCI was used one time and lost once it was reported because different programs were tagging or coding their research portfolios in different ways, depending on what was most useful for them. We found too that the NCI Division of Extramural Affairs carefully coded incoming grants for required reporting purposes, but, like other programs, their system worked best for their individual program and wasn’t easily usable across the Institute.
The idea of developing the Enterprise Vocabulary Services (EVS), a project I helped originate and that I’ve led for several years now, evolved so that we could consistently track the NCI research portfolio over time. Then, gradually, scientific needs emerged, and our focus shifted to biology (basic, translational, and clinical research).
John Silva, another of my mentors, had a great vision of how semantics might revolutionize how we collect data from clinical trials (i.e., enabling the use of care data from electronic health records), and that is starting to come to pass. Originally, John and NCI supported work on a Stanford tool called Protege as a way of standardizing protocol development. But it is now used widely for editing terminologies and ontologies, including by NCI. So, something we helped support has had a major impact on the health research community and beyond. Today NCI is also one of the leaders at NIH in standing up and using metadata repositories. Semantics is an important part of making data FAIR and is now well known across NIH and beyond. Semantics is sure to continue to be an integral part of NCI’s work going forward.
Ultimately we hope to make it so easy for users to prepare and submit data that they won’t even realize all that goes into making data FAIR. That is, semantics will be done smartly, under wraps, and without a lot of added work. Someday there will be semi-automated mapping of incoming data sets to the appropriately targeted data dictionary. And search tools will understand the semantics of the data and return relevant information about that data or perhaps point to the data itself. We also should have the ability to ask and answer increasingly sophisticated scientific questions using semantics, AI, and the computational and algorithmic power of cloud-based data resources. I think CBIIT and CCDH will be important in leading the way on this by providing services, tools, guidance, and best practices to the community that will make it easier to accomplish those goals over time.
You’re planning to retire in just a few months. Are there activities you’re most looking forward to after retirement?
I greatly enjoy and am awed by the science that NCI supports and believe absolutely in NCI’s mission. I’ll miss the important work taking place at CBIIT. Still, I have a long list of activities queued up! I’m excited to plan a month-long vacation to Portugal to walk the Camino de Santiago. And I want to spend more time with my already retired friends. My other passions include getting more involved in the Sonoma Land Trust or similar local organizations to ensure the areas I enjoy so much are preserved for future generations. And I’d like to improve my pottery making skills.
I feel that the data science field is on the verge of some wonderful advances and I’m happy I was able to help contribute to its progress. The next 10 years should yield amazing results as we get even better at re-using and aggregating both large and small data sets to further inform research and improve public health.