Cancer Data Science Pulse

Did the Machine Get it Right? Learning to Trust Neural Networks in Medical Applications

The December 18 webinar has passed, but a recording is now available on the event page.

On December 15, Mrs. Aya Abdelsalam Ismail will present the next Data Science Seminar, “Interpretable and Explainable Deep Learning.” In this blog, Mrs. Ismail examines Deep Learning and the importance of developing models we can trust to make critical patient-care decisions.

You’ll be discussing the topic, “Interpretable and Explainable Deep Learning,” in the upcoming webinar. Can you give us a brief summary of this topic? 

Deep learning (DL) has been deployed in a lot of applications. It’s being used in our day-to-day lives, ranging from virtual assistants and personalized shopping, to facial recognition apps. DL also holds promise for the medical field. But before we deploy this technology to help patients, we first must be sure it’s truly reliable and that we can trust it to make accurate decisions. How do we make sure the models we are building are actually making the decisions that we want them to make? 

DL models rely on a “black box,” the area of the system where inputs and operations aren’t visible to the user but instead are self-directed by the model. That is, data are interpreted, without our full understanding of how those “decisions” are made. I’ll describe how we can better understand these models and the steps they follow in arriving at these decisions. Knowing more about these models will enable us to see when they are not performing well. I’ll also describe some of the tools we can use to improve the models we are developing. 

How did you become interested in this topic?

I have always been drawn to the use of DL in medical applications, especially as a means for following a disease or disorder over time. I started by investigating how the human brain works while performing a simple task, to do so I developed a model to classify tasks based on brain images over time.  After achieving good classification accuracy, it was time to understand which areas of the brain were responsible for doing different tasks. To do this we used saliency maps, which are produced by the DL model to highlight the region in the brain used by the model to make the prediction. 

Unfortunately, this didn’t work. More importantly, in trying to understand why it didn’t work, we realized the methods used to explain how the model was performing weren’t accurate. That is, the explanations for why the model produced a certain prediction didn't make sense.

At first, we thought it was something we did wrong. Perhaps we made errors in recording the data or in programming the model. But after double and triple checking our work, we realized that it wasn’t our error. It was the model’s interpretation. The model assigned saliency incorrectly to different areas. Although the model accuracy was high, the explanations were incorrect. And despite the fact that it’s a common model and used routinely across the informatics field, no one had previously questioned it.

This was the basis of our first paper, “Input-Cell Attention Reduces Vanishing Saliency of Recurrent Neural Networks,” which underscored the fact that these models are not interpretable, and we need to find better ways to understand them. This led to a series of papers looking into these models. We wanted to find out when the models work, when they don’t, and what we can do to fix them. In a sense, this very wrong turn led to where I am today.

Your work is rooted in neuroscience. Does it also have relevance for other fields, such as cancer?

I began by looking at ways to forecast whether a patient would develop Alzheimer’s disease. I hoped to be able to determine a person’s risk for disease years before the clinical signs appeared. Although this work began as a neuroscience application—looking at the brain’s electrical activity as it relates to function and assigning neurometrics—this DL model is relevant to any area where data have a sequential element (that is, proceeds in a step-by-step fashion over time).

The challenge of using DL can be traced to the black box, which is an area that we can’t readily see and explain. Especially when designing medical applications, which may require life-saving decisions, we simply can’t rely on these models without thoroughly understanding them. We need to turn the black box into a white box.

The problem is that many of the methods in use today for evaluating these models aren’t well understood, primarily because we aren’t able to interpret models that deal with sequential data. As our models are designed now, we skip to “Step B” before we fully understand how “Step A” works. I wanted to be able to apply DL to cases where a patient is followed over time and where time series data are very important to making a prognosis. To do that, we need to know why the model is making the decision at every step in the process. 

This is especially relevant for the cancer field. Images are obtained from patients, often at different points in time. Those images are scanned, and the data are fed into models for making a diagnosis. It’s important to know how the model came to a particular decision, at a particular time, if we are to arrive at the right prognosis. 

As an example, in looking at one model that returned inaccurate results, we realized it was reflecting an artifact from the scanner the investigators used to input images. When a different scanner was used, the model returned very different results. This illustrates how important it is to determine if a model is performing well because it’s actually identifying, for example, people with cancer, or if the findings are the result of something completely unrelated, such as an artifact in the scanner or an outlier in the data. 

Sometimes changing just one seemingly insignificant parameter can lead to inaccuracies. We need to ensure that the model is making a decision because it’s correct, and we need to track that decision to a medical reason so a physician will be able to clearly recognize, understand, and agree the evaluation. 

I’ve been particularly fascinated by DL’s potential use in cancer care because my father is an oncologist. Some potential applications in this area include building a model to detect cancer or predict who might develop cancer or identifying a drug that’s most likely to be effective in a particular patient. 

If we can’t explain how a model arrives at a decision, we shouldn’t be using it. It’s imperative that we are able to trust the model, especially when interpreting data that helps to inform patient care.

Where do you see this topic headed in the next 5–10 years? What is your hope for this technology?

This area is very theoretical right now. I don’t mean this in the sense that DL doesn’t have medical applications, it does. But the models deployed today can’t be used to replace decisions made by physicians. The models simply are not reliable enough. Once we can explain the models and trust them, we’ll be able to use them for widespread applications. 

You might compare this to early airplane travel. In the beginning, airplanes were used mostly to move cargo around the country. Then, as air travel got safer, the airlines shifted to passengers. Today, we go everywhere, all the time, by airplanes. We are in the “cargo” state with DL in medical applications. We need to ensure they are safe before they can be deployed in clinical practice without the fear of adverse consequences in patients. 

Who should attend the webinar? 

People who are using or implementing DL models in their research and those who have at least some familiarity with DL will likely find it most useful. They will be more likely to have encountered some of the issues I’m going to address. I hope the seminar will help them understand the limitations of the models they are building and offer tools to help them overcome those limits. 

Were there any surprises that you encountered in your work on this topic?

Many people have been using these models. We know now that the methods being used to explain the models are not always correct. So although these models have been in widespread use, no one has been able to understand if the models are making the right decisions for the right reasons. When they are not accurate, we need to be able to say why. 

Were there particular challenges that you had to overcome?

I think there were two key challenges. First, in looking into medical applications, I needed to sift through a lot of medical papers. I’m a computer scientist, so learning to interpret some of the nuances in these papers was challenging at first. And second, I’ve found that medical data are often corrupted or filled with meaningless information. For example, fMRI images can produce data that are very “noisy” (filled with artifacts based on how the patient was positioned or the times at which the scan was made) and processing the data is not a trivial task. I thought I was going to be able to take the data and push the information into a neurometric and that would be it. This turned out to be much more complicated. I had to learn how to remove the subject variance from each scan. Removing those individual data variations was vital so that the data could be integrated and analyzed in the same way. 

There also were a lot of noisy labels. One patient might have two physicians who might arrive at different prognoses. How do you reconcile this? This was another area that I needed to learn more about to ensure the data were as accurate as possible. 

Is there a website where can people go for more information?

Yes, my website lists several papers on this topic. 

Aya Abdelsalam Ismail, M.S.
Department of Computer Science, University of Maryland
Older Post
Spotlight on Staff—Allen Dearry Reflects on His Time at CBIIT
Newer Post
Semantics Primer