Cancer Data Science Pulse

Blinded by the Light—Seeking the Truth Behind Data Outliers

On November 17, David Kepplinger, Ph.D., will present the next Data Science Seminar, “Robust Prediction of Stenosis from Protein Expression Data. ” In this blog, Dr. Kepplinger describes why it’s important to delve into unexpected data values when conducting biostatistical analysis.

You’ll be discussing the “Prediction of Stenosis from Protein Expression Data” in the upcoming Data Science Seminar Series webinar. Can you tell us what first interested you in this topic? 

When I was working on my Ph.D. at the University of British Columbia, I had the opportunity to assist with the final analysis of a study. The study had just concluded, and the goal was to identify proteomic biomarkers of cardiac allograft vasculopathy, a major complication suffered by up to 50% of cardiac transplant recipients after the first year of transplantation. Ultimately, the group leading the study, including my Ph.D. advisor Dr. Gabriele Cohen Freue, hoped the findings would help to develop a noninvasive screening tool to predict heart vessel stenosis. Quantifying stenosis typically requires an invasive procedure or imaging that needs to be done in specialized centers and which carries a risk of radiation exposure. Having a screening tool that could be deployed quickly and more often could help us identify cardiac transplantation complications early on.

Like most studies, the stenosis investigation uncovered some issues with the data. There were unusual stenosis values and unexpected proteomic profiles found within the cohort of subjects. To accommodate these issues, the group went back to basic, firmly established statistical analyses. This helped identify general associations and draw meaningful findings. Time just didn’t allow them to develop the methods required for a more robust examination of outliers and their effects. 

Still, Dr. Cohen Freue and I had a nagging feeling that using these basic analyses meant that we weren’t truly interpreting the data in the most meaningful way possible. She had already done some preliminary work on how to tackle this problem, so we joined forces to investigate possible solutions that not only would apply to this study but to similar studies looking to identify associations between large numbers of genes/proteins and the severity of disease. I was very happy to give more attention to this issue, which then found its way into my dissertation and launched this new line of research. 

What is it about your background that made you want to look more closely at these data anomalies?

I guess I’ve always been rather exacting when it comes to my work. While at the Vienna University of Technology as an undergraduate, and when studying for my masters, my advisor, Professor Peter Filzmoser, was very keen on robust statistics so I learned right from the start that it was important to account for those outliers and examine them more closely.

Ultimately, I believe findings are more generalizable when you take the time to identify unusual values and chase down the specific reasons for their presence. It not only gives us a better understanding of what’s driving the entire process, it also gives us insight into the subprocesses that might be present.

Oftentimes, these artifacts are because the study populations aren’t as homogenous as we thought. Especially now, as we collect such vast amounts of data, we see more heterogeneity than ever before. There are vast differences in genomic profiles, even in populations that we think will be very similar. However, omitting individuals from the analysis because they don’t appear to fit the norm may cause us to miss underlying trends. We may miss what those subjects have in common with others.

The issue, then, is how we can use highly complex data with all their rich and vast information without being blinded by the very bright light cast by these unusual values.

Your solution was a new algorithm. Can you tell us more about PENSE and PENSEM?

Yes, we developed a penalized adaptive elastic net S-Estimator for linear regression, which we termed PENSE, and it proved to be highly robust toward these unusual values. This method accommodated the outliers and still gave accurate predictions of stenosis while also identifying the relevant proteins, even in the face of many unusual values.

Then, as an additional step, we performed an M-step after the EN S-Estimator. This solution, called PENSEM, allowed us to further refine the original solution. That is, once we were able to determine which blood samples were the most reliable, we used PENSEM to give us an even better selection of a panel of biomarkers.

Your work so far has been related to stenosis. Is there a potential application for PENSE and PENSEM in cancer research?

I do see many potential applications to cancer research, although these studies generally are more complex than the stenosis study. Cancer studies often include both cross-sectional and longitudinal data, and the outcomes are more difficult to quantify compared with our stenosis study. 

I’m currently working on a study examining prostate cancer severity, which typically is measured using the Gleason Score. We’re using a holistic approach to see what’s driving these severities by tying these outcomes to the proteomic profile of the tumor itself while accounting for somatic mutations. 

Unfortunately, timelines for many of the studies today just don’t allow for substantial statistical development. Ideally, biostatisticians would have the time to research and then select the perfect method. But often that method doesn’t exist yet, and resources to develop these methods are scarce. Even if the ideal method does exist in the literature, it often falls on the biostatisticians to find and implement it before it can be applied in a particular study. I feel strongly that easily accessible software tools should be a standard in academic statistical research. This would ensure that methods are readily available to our colleagues across different disciplines.

Who should attend the webinar? What can they expect to learn from this hour with you? 

I think anyone conducting quantitative analyses should attend. I think it’s useful to see the profound effects unusual values can have if we don’t account for them, and how poorly some of the most common methods perform when these outliers are present.

This is something that isn’t talked about as much as it should be. The go-to method is to throw out the outliers that are strikingly unusual. I hope to show the audience that there are better ways to deal with these artifacts. I’d like them to see the importance of embracing the outliers. There are so many ways—both good and bad—that these values can impact your analysis. Ultimately, we learn more about the data if we include them than if we omit them. 

I hope the main takeaway for the audience is that there are alternative statistical methods. With the right tools, it’s possible to extract more information from your data and learn from unusual values rather than simply ignoring them. 

Where do you see this topic headed in the next 5–10 years? What is your hope for this technology/methodology?

I hope to raise awareness among users that there are alternatives to our typical statistical methods, and unusual values can impact analyses. Today, we tend to focus on unusual values when using labeled data to train models to look for associations. We don’t give this same attention to new, unlabeled samples. 

For example, we now have a biomarker assay to predict the stenosis. But if a new patient comes into the clinic with a proteomic profile very different from what we’ve seen before, it could be challenging to quantify the stenosis. Our existing models would give us a way to generalize and estimate associations between protein levels and stenosis. Still, if this new sample is very different, it might lead to less-than-reliable predictions. 

Thus, the next step in this line of research is to ensure that we not only start with a robust training for our statistical models to account for unusual values, but also build in ways to refine predictions when new data become available. I’m working with colleagues now to identify ways of making these predictions more robust and incorporating levels of uncertainty into our predictions. We have more work to do to ensure we gain reliability over time.

Where can people go for more information? Is there a recent paper? A website?

Yes, you can find more information on my website, including links to tools and recent publications. 

David Kepplinger, Ph.D.
Assistant Professor of Statistics, School of Computing, George Mason University
Older Post
Five Data Science Technologies Driving Cancer Research
Newer Post
Cloud Resources: Cancer Genomics Cloud Helps Power Data Discovery and Analysis to Advance Cancer Research

Leave a Reply

Your email address will not be published.

CAPTCHA Image CAPTCHA