Cancer Data Science Pulse
Computer Savvy Scientist Blends Technology with Biology to Create Attention-Based Deep Learning Methods for Genomics Data
You’ll be discussing the technology “AttentiveChrome,” in the upcoming webinar. Can you tell us a little about this technology?
Yes, AttentiveChrome is part of a series of deep-learning work, collectively called DeepChrome. We’re using this tool to model and interpret DNA sequence patterns and other chromatin signals (like those from histones that help organize and compress the DNA structure). AttentiveChrome allows us to pinpoint how histone modifications, known as “marks,” over the chromatin influence gene regulation. It not only charts where important marks occur but also captures the complex dependencies among various input signals. By training two levels of attention simultaneously, AttentiveChrome can model all the relevant marks as well as identify important positions per individual mark. The result is a diagnostic tool that can help interpret massive amounts of genomic data to find epigenomic marks and gene products correlated to a particular disease, such as cancer or cardiovascular disease. Our hope is that it someday will prove to be an effective and highly efficient tool for developing treatments.
Who should attend the webinar? And why? What can they expect to learn from this hour with you?
People interested in deep learning applications and genomic data should consider attending. My presentation will be more of a case study on how to use deep learning and, most importantly, how to improve this technology for genomic data analysis. The typical “out of the box” deep learning applications are designed more for computer vision (i.e., imaging data). Using deep learning to interpret genomic data properties requires a different set of methods. It’s not two-dimensional vision data.
How did you first become interested in this topic area?
I graduated from Carnegie Mellon with a degree in Computer Science. From there, I started work in the industry and specialized in machine learning. I was fortunate that my lab did a lot of pioneering work in deep learning, so I was exposed to a lot of innovative ideas. But my interests weren’t purely computational. I also have a background working with biology data, and, as a Ph.D. student, I worked with protein data, specifically on membrane receptor proteins. I just had the feeling from my doctoral dissertation, which was on protein interaction networks, that I hadn’t fully finished. I wanted to understand the genetic aspects of those networks. I was still curious. I wanted to explore those networks. I found the freedom to explore those topics in academia.
Thanks to my time working in the industry, I had some unique skills that I hoped I could apply to genomic and epigenomic data. I was fortunate to be able to collaborate with Dr. Mazhar Adli, a researcher on the frontlines of epigenomic data generation and technology. He opened the door to so much data! I started to think that maybe we could use deep learning as a way of analyzing those massive amounts of data. It was the right time, the right skills, the right topic, and the right collaborators. As a result, our group at the University of Virginia was the first to look at these questions using machine learning with DeepChrome, and the current solution, called AttentiveChrome. More recently, we developed another approach, GCNChrome, for predicting epigenetic state using Both Sequence and 3D Genome Data.
You mention that even after leaving school you remained curious about genomic data. What is it about this topic that you find so interesting?
I like that data can have a direct impact on human health. That is, you can connect biological data directly to cancer patients. I wanted to work on questions like this—questions that can lead to answers that impact society and public health.
That’s important to me. AttentiveChrome is a potentially useful tool. But it’s more than that, as it has the potential to impact our understanding of many different types of diseases and help find their treatment.
Because I had two advisors—one in computer science and one in biology, I had a unique opportunity to look at research questions from two very different perspectives with two totally different teams.
How has your interdisciplinary background helped you approach this research?
These truly are two different communities. They have different cultures, different norms, and different ways of organizing science. I’m fortunate that, because of my training in both of these areas, I’m able to approach my work from both perspectives. So rather than a clash of cultures, I’m able to naturally see both sides and ask the most relevant questions.
When you work with a computer scientist, the thinking is one way, and when you work with the biologist, the thinking is another way. For example, the computer scientist tends to look at a problem from a flat or linear perspective—that is, you go from point A to point B. And it’s super fast, with really quick publication cycles, so advances in technology tend to happen very rapidly.
On the other hand, the biologist tends to make discoveries that are a small part of a much larger “biological machine.” One line of questioning might fill in part of the solution but also lead to more questions that still must be addressed. Findings often are published in many papers by many different labs along the pathway of discovery. It’s much more difficult to convert findings from those studies into products or solutions.
Ultimately, my hope is the technology we develop will help connect what we know from basic science to translational research. I’ve been so impressed by the advances made during this recent pandemic in mRNA technology, and how that’s been translated to vaccines that are making a difference in people’s lives. The progression from basic science to effective vaccines was a huge research effort—by individual scientists, pharmaceutical companies, labs in different countries. It shows you how far we can go with such an open, collaborative effort. I hope to be a part of a similar effort.
Was there a particular challenge or obstacle that you encountered in developing AttentiveChrome?
As with all classic deep learning models, we encountered the issue of the “Black Box.” That is the figurative “box” that falls between data input and analysis output. This area reflects the unknown or uncertainty in understanding how the models perform predictions. And as a developer, you normally seek to train a model that will give you the highest accuracy and do not consider how the model achieves accurate predictions.
We can view biological data flow as a linear flow chart, where one step happens that leads to another step, and so on, moving from DNA, RNA, protein, cell, organism, genetic variant, and ultimately to disease. To create a data-driven model to interpret this linear and highly interactive chain of events calls for a highly composable data analysis approach. Each individual step needs to be modeled and then cross connections among all the steps also need to be considered if we are to capture what’s happening within these biological layers. This makes it super complex.
To gain a global perspective, we started by looking at whether we could predict if a gene was expressed or not expressed based on measurements on the DNA. You can think of this as a series of switches. In that analogy, each measurement on DNA and chromatin is a different switch to gene regulation. We looked at all the chromatin profile measurements (such as transcription factors, histone modification, DNA seq, etc.) to see what combination of flipping those switches turned on a “lightbulb” or not. Did it lead to gene expression or not?
We started with a pretty simple model called DeepChrome. And it was great for showing us if the gene was expressed, but it couldn’t answer this question: what contributes more and what contributes less? This motivated us to take this classic deep learning computation and to create AttentiveChrome.
AttentiveChrome considers the properties of biological inputs. We aimed to model each mark by itself, then model all the cross connections. By taking it beyond the one-level model we were able to magnify all the steps at a local level, and then take the model to a global level, eventually leading to the final output.
Using this approach, we can interpret which position was important and which base pairs had a role in gene regulation. We can see which switches were most important for that gene to be turned on or off. That’s why we call it interpretable deep learning design.
Where do you see this topic headed in the next 5–10 years?
I think the next step will be to conduct benchmarking. Everyone has their own approach right now. We need to see what tool works best for genomic data. We saw a similar thing with computer vision. There were a lot of different groups and approaches. Benchmarking allowed a “winner” to emerge. I feel like quantitative benchmarking with biological data is super important. It will help us truly change the industry.
Where can people go for more information?
I’ve assembled a lot of online resources, both on AttentiveChrome and on the challenges with using Deep Learning for biological data. See the links below for further information.