Your Source Code is Your Data: Data Engineering for Medical Imaging Research in the Era of AI
The success of an AI system depends on the amount and quality of data used to train it. The database that was key to the latest AI revolution (ImageNet) contains millions of real-life images labeled into thousands of categories. No data collections of comparable extent and quality exist for radiology data. By many, this is considered to be the biggest challenge for AI in radiology. Training of AI models requires medical images accompanied by metadata and expert annotations (e.g., spatial location of the finding, its clinical characteristics), ideally linked with the non-imaging part of the patient record (e.g., biopsy results, genomic and blood serum tests). Large volumes of clinical images are routinely collected, interpreted visually and analyzed quantitatively, both in clinical and research studies.
Nevertheless, the result is often optimized for reuse by a human — not an algorithm. Tremendous effort is often needed to prepare datasets for AI training, combine data sets across sites or collections, or aggregate versatile datasets as often required to develop robust models. With the recent advances in automated imaging-based tissue phenotyping (radiomics) and other relevant AI technologies, there is a new realization of the value of the large, structured AI-ready datasets.
There are many obstacles and few incentives for engineering datasets to optimize machine-level reusability. Non-technical issues aside, there are major challenges of choosing a data format, defining a data model, deciding what attributes of the data may be valuable for the future unforeseen use cases and how those can be captured in a structured and self-documenting manner, and identifying practical tools to help with those tasks. Over the past five years, we have directed our efforts to incrementally and collaboratively advance data engineering practices as applied to medical imaging research. We are extending the existing, broadly adopted DICOM standard, to support the needs of medical imaging research applications, and subsequent implementation into clinical systems. We develop open source tools that enable standardization of common outputs of image analysis. We established collaborations with a number of academic and industry groups to encourage, support and evaluate adoption of the standard. We have been leading efforts in training and outreach, aiming to educate the community about the capabilities of the standard and the supporting tools. In parallel with developing support for the generic data types commonly encountered in imaging research, we are also working on targeted solutions for the specific research workflows of interest in several cancer types.
In this talk, I will discuss our progress to date in developing the ecosystem of standards, tools, use cases, datasets, publications, and outreach activities that have the overarching goal of improving data engineering practices. I will also present some of our ongoing work developing integrated technology solutions that are used to support clinical research at our site, and the role of data as the backbone of downstream innovation.
Andrey Fedorov is an Assistant Professor in Radiology at the Surgical Planning Laboratory (SPL), Department of Radiology, Brigham and Women's Hospital and Harvard Medical School. Andrey joined SPL in 2009 after obtaining his Ph.D. in Computer Science from The College of William and Mary in Virginia. His research is in translation and validation of medical image computing technology in clinical research applications, with the focus on quantitative imaging, imaging informatics and image-guided interventional procedures. Andrey is committed to advancing the role of reproducible science, data sharing and open source software in academic research. He has contributed to a number of open source projects, most notably 3D Slicer (http://slicer.org). Together with Ron Kikinis, he is a co-PI of the Quantitative Image Informatics for Cancer Research (QIICR) project (http://qiicr.org) focused on developing open source informatics technology in support of quantitative imaging biomarker development, and interoperable sharing of the imaging biomarker data using the Digital Imaging and Communications in Medicine (DICOM) standard.