Cancer Data Science Pulse
Next Generation Artificial Intelligence: New Models Help Unleash the Power of AI
Meet the people who are breaking new ground in the data science field! Whether it’s a new tool, a new model, or a completely new way of using data, we capture the paradigm shifts of today that are helping to drive the cancer research discoveries of tomorrow.
Today, we’re featuring Svitlana Volkova, Ph.D., chief scientist at Pacific Northwest National Laboratory (PNNL). She’ll describe how she’s using “foundation models,” a term coined by the Stanford Institute for Human-Centered Artificial Intelligence, to give scientists and analysts a new tool for unleashing the power of artificial intelligence (AI).
According to Dr. Volkova, these multi-purpose models offer a paradigm shift in AI. A foundation model is developed using large amounts of data in a completely unsupervised manner. This results in a single model that can be rapidly applied to many different tasks, without being specifically asked to do so, and with only minimal fine tuning. Such an approach is particularly relevant for researchers looking to use a multi-omic, integrative-data approach to cancer research.
Can you describe how foundation models are different from other AI models?
More and more, we’re finding new ways to augment human intelligence through the use of AI, or machine learning. We’re currently using this technology to analyze massive amounts of complex data to find patterns, forecast trends, make predictions, and more.
Our unique human-centered partnership with AI hinges on our ability to develop models that enhance our own cognitive performance, including how we learn, reason, and make decisions. The better our models, the better our ability to augment these human tasks.
Until just a few short years ago, we used to take data and train the model on that data, and then apply the model to one specific task, such as object recognition or text translation. This is what we refer to as “narrow” AI.
Now, we’re moving beyond the narrow approach to a much broader view, called a “foundation model.” This all-encompassing tool can be applied to many tasks with minimal adjustment. For text applications, these might include summation, machine translation, knowledge translation, reasoning, etc. In short, it allows us to apply a single model to a wide variety of tasks.
People may ask, “Why and how is a foundation model better?” Some might even argue that it’s detrimental to develop a model like this, as it impacts the environment and takes a lot of resources and time (often several months) to train. Yet, we’re seeing that the benefits, especially in terms of accuracy and generalizability, greatly offset these costs.
Compared with narrow AI, the accuracy of foundation models is far superior. Foundation models also are highly generalizable. This is a key attribute and speaks to one of the biggest problems with narrow AI. When you develop one model for one task, the model often can’t be easily moved to a new data set, task, or domain. That’s been a consistent problem with AI—one that especially impacts our ability to share findings, reproduce results, and move science forward.
What’s the advantage of using foundation models in cancer research?
There are a lot of opportunities for foundation models in cancer research. Because these models are expensive to develop and train, they’ve primarily been limited to industry and aren’t common in the biomedical field. However, recent advances in hardware and model development allow us to train models faster and with much lower cost.
Two good examples are OpenAI and Google’s DeepMind. These open-ended algorithms are built for sustainability. The model keeps learning and solving new tasks (seemingly forever) and only exceeds 1 billion parameters while solving more than 600 tasks, which is similar to the DeepMind’s multitasking Gato model.
Most importantly, foundation models are showing outstanding performance as they’re deployed across different modalities. For example, models like BERT and GPT-2 have completely transformed the field of natural language processing. Likewise, models such as AlexNet have been game changers in enhancing our ability to interpret images.
We’re now exploring opportunities to apply these large-scale models to domain-specific cancer data, especially multi-omics data.
There’s already been some success. In a recent paper in PNAS, investigators trained a very large (nearly a billion parameters) model using 250 protein sequences. The researchers then showed how that protein model could be adapted and applied to many tasks, including previously narrow AI tasks, such as looking for biological structure and function. This research is very cutting-edge in this area.
Are there risks that are unique to a foundation-model approach?
There are risks, but they really aren’t limited to foundation models. Similar risks apply to even narrow models. All models are data driven so it’s possible to introduce bias. Likewise, if the model isn’t developed in a way that’s responsible and fair, it makes it harder to trust it or hold it accountable.
What do I mean by responsible? This speaks to the model’s experimental setup. That is, the community needs to be able to reproduce the model and understand how it was built, so each step that goes into the application needs to be made publicly available. I strongly believe in reproducible research.
Fairness is another aspect, and is especially true for the biomedical domain. What if all our protein data are from a single population that completely omits underrepresented minorities or other environmental factors? For the most accurate results, we need a fair distribution of data, which is usually hard to get.
In testing the model, we need to be sure we can trust it. It’s not enough to meet previous baselines. That is, we can’t simply show that our model performs similarly to other models in the field today on a certain data set. We have to test robustness. We need to use the model on data that fall outside of our original data sets to see how it performs. We need to use adversarial attacks to probe the model’s behavior. Then, we need to look critically at the outcomes.
In the end, there’s no universal hammer or miracle model that can solve every task, every time. It’s important to report the limitations right from the start so model users understand what the model can and cannot do. Setting clear expectations is key to building accountability.
We need full transparency if we’re to move model development and deployment forward.
You mention that foundation models rely on large quantities of data. Are there problems associated with using so much data?
Throughout my career, I’ve worked with massive amounts of data, including human-generated social media data. Most of the models adopted by the AI community reflect how we learn as humans. If you’re exposed to only one thing, you can only learn so much. Imagine you’re sitting in your house and exposed only to your living room. You’ll have only a narrow perspective of the world. Then, imagine that you travel to the nearest town, then to a major city, and finally overseas. Suddenly, your horizons are broadened considerably. As a learner, and like our model, this expanded traveling means you’re exposed to many more “data points.” This is what underlies our ability to improve the model’s accuracy and generalizability.
Of course, there’s also a downside to this exposure. For example, a foundation model trained on internet data will, in turn, reflect all the human biases it encounters, including inaccuracies and distortions. The model learns what it is exposed to unless we intervene. As a result, we’ve learned we need to put our models on a leash.
To relate that to our example here, when we “leash our model,” it’s similar to going to the city, but following a specific guidebook that takes you only to the places you want to go.
There are multiple ways we can put a model on a leash—either by guiding learning as it occurs or by adjusting the model after the fact. No one wants to use a model that learns any kind of bias. We want to build technology that’s better than us.
Speaking of building technology, where do you see foundation models headed in the next 5–10 years?
Today, AI is at a stage where it’s proving useful for augmenting the work done by scientists and analysts. That is, AI helps humans perform certain cognitive acts like reasoning, making decisions, or gaining knowledge from large amounts of data. Yet, AI is still handicapped. AI hasn’t made the leap to the next level to become at least semi-autonomous and operational.
In the next few years, we should see more of this autonomy—not full autonomy of course, but semi-autonomous models that could actually drive laboratory automation. Humans will still need to be the “guidebooks,” leading and interacting with the model, but this partnership is already showing promise. Alphafold is one example, giving scientists a tool to help predict protein structure and function.
Other foundation models will help streamline the discovery process. For example, it’s nearly impossible to stay completely up to date on all of today’s scientific literature. No human can read and interpret the significance of those findings in a timely and meaningful way. In the future, foundation models will be able to glean the latest literature, generate recommendations, and inform the scientist’s next steps.
In 10 years, I think we’ll take this even further, and models will be a part of the discovery lifecycle. The model will read the existing knowledge and generate recommendations for scientific hypotheses and experiments. The scientists will be able to focus on implementing what the model finds, hopefully with limited intervention.
Can you tell us what led you to study foundation models?
I was born in Ukraine, and I came to the United States 13 years ago on a Fulbright Scholarship. When completing my masters at Kansas State University, I first heard the term “machine learning.” At the time, I was working on extracting biomedical information from open-source data, including scientific literature, web pages, reports, etc. Back then, I was particularly excited about machine learning and natural language processing using narrow AI models.
I went on to earn my doctorate at Johns Hopkins University (JHU), working with open-source data and machine learning models. (This was pre-deep learning.) Working in JHU’s Human Language Technology Center of Excellence and Center for Language and Speech Processing, I began to build models to help study human behavior online. Although not directly related to the biomedical domain, this work was similar in that we examined the full range of the human experience—their interests, properties, actions, preferences, emotions, etc. In a way, this is similar to the multifaceted data we see in cancer research.
Today, my work continues at PNNL, where I’m leading an internal investment called Mega AI. Mega AI focuses on developing foundation models of scientific knowledge to augment our ability to perceive and reason at scales previously unimagined. To date, we have developed small scale, but sustainable, 7 billion parameter models from the scientific literature for chemistry and climate sciences. This work is in contrast to some of the larger models today, such as one developed by China that’s 2 trillion parameters, or even Google, which is not far behind at 1.7 trillion. At PNNL, we’re committed to making smaller, sustainable foundation models, which specifically focus on science and security applications that can be put into practice much faster.
Dr. Volkova was recently featured in CBIIT’s Data Science Seminar Series. Visit the website for more information on the Data Science Seminar speakers and other events. A library of previously recorded Seminar Series webinars also is available.