Natural language processing
Viruses lead a rather repetitive existence. They enter a cell, hijack its machinery to turn it into a viral copy machine, and those copies head on to other cells armed with instructions to do the same. So it goes, over and over again. But somewhat often, amidst this repeated copy-pasting, things get mixed up. Mutations arise in the copies. Sometimes, a mutation means an amino acid doesn’t get made and a vital protein doesn’t fold—so into the dustbin of evolutionary history that viral version goes. Sometimes the mutation does nothing at all, because different sequences that encode the same proteins make up for the error. But every once in a while, mutations go perfectly right. The changes don’t affect the virus’s ability to exist; instead, they produce a helpful change, like making the virus unrecognizable to a person’s immune defenses. When that allows the virus to evade antibodies generated from past infections or from a vaccine, that mutant variant of the virus is said to have “escaped.”
Scientists are always on the lookout for signs of potential escape. That’s true for SARS-CoV-2, as new strains emerge and scientists investigate what genetic changes could mean for a long-lasting vaccine. (So far, things are looking okay.) It’s also what confounds researchers studying influenza and HIV, which routinely evade our immune defenses. So in an effort to see what’s possibly to come, researchers create hypothetical mutants in the lab and see if they can evade antibodies taken from recent patients or vaccine recipients. But the genetic code offers too many possibilities to test every evolutionary branch the virus might take over time. It’s a matter of keeping up.
Last winter, Brian Hie, a computational biologist at MIT and a fan of the lyric poetry of John Donne, was thinking about this problem when he alighted upon an analogy: What if we thought of viral sequences the way we think of written language? Every viral sequence has a sort of grammar, he reasoned—a set of rules it needs to follow in order to be that particular virus. When mutations violate that grammar, the virus reaches an evolutionary dead end. In virology terms, it lacks “fitness.” Also like language, from the immune system’s perspective, the sequence could also be said to have a kind of semantics. There are some sequences the immune system can interpret—and thus stop the virus with antibodies and other defenses—and some that it can’t. So a viral escape could be seen as a change that preserves the sequence’s grammar but changes its meaning.
The analogy had a simple, almost too simple, elegance. But to Hie, it was also practical. In recent years, AI systems have gotten very good at modeling principles of grammar and semantics in human language. They do this by training a system with data sets of billions of words, arranged in sentences and paragraphs, from which the system derives patterns. In this way, without being told any specific rules, the system learns where the commas should go and how to structure a clause. It can also be said to intuit the meaning of certain sequences—words and phrases—based on the many contexts in which they appear throughout the data set. It’s patterns, all the way down. That’s how the most advanced language models, like OpenAI’s GPT-3, can learn to produce perfectly grammatical prose that manages to stay reasonably on topic.
One advantage of this idea is that it’s generalizable. To a machine learning model, a sequence is a sequence, whether it’s arranged in sonnets or amino acids. According to Jeremy Howard, an AI researcher at the University of San Francisco and a language model expert, applying such models to biological sequences can be fruitful. With enough data from, say, genetic sequences of viruses known to be infectious, the model will implicitly learn something about how infectious viruses are structured. “That model will have a lot of sophisticated and complex knowledge,” he says. Hie knew this was the case. His graduate advisor, computer scientist Bonnie Berger, had previously done similar work with another one of her lab's members, using AI to predict protein folding patterns.
So this spring, Berger's lab tried out Hie’s analogy, and the results are out today in Science. At first, the team had been interested in influenza and HIV, which are both notorious for evading vaccines. But when they began their lab work in March, sequences from the novel coronavirus were becoming available, so they decided to add those in as well. For all three viruses, they homed in on sequences for the proteins the viruses use to enter cells and replicate, explains Bryan Bryson, a professor of biological engineering at MIT and a coauthor of the research. These also happen to be primary immune system and vaccine targets. They’re the places where antibodies latch on, preventing the virus from entering a cell and marking it for destruction. (For SARS-CoV-2, that’s the spike protein.) For each of the viruses, the MIT team trained a language model using the genetic sequence data instead of the usual paragraphs and sentences.
Then they checked on what the model learned about the sequences. Sequences deemed to have similar “meanings” should infect the same hosts, the researchers reasoned. The genetic language of a swine flu would be semantically more similar to another swine flu than a flu that normally infects humans. They were pleased to see that this was the case—and also to find that certain strains that had spilled from one species to another in the real world, like avian flu in 1918 and 2009, were scored as semantically similar. Then they checked the grammar. How well did a sequence’s “grammar” score correspond to how viable a virus was in real-world conditions? The researchers gathered data from past research quantifying the fitness of various mutants—how well they binded to or replicated in cells—for all three viruses, and then examined how grammatical the model believed those sequences to be. Grammaticality seemed to be a good proxy for their fitness.
But Bryson and Hie wanted to know if combining the two proxies could predict viral escape. When they compared their model’s predictions to prior known instances of actual viral escape, the influenza model was the most predictive. That wasn’t surprising, because the data set they used to train the model was particularly large, including years’ worth of influenza sequences and a wealth of mutations known to sneak past the human immune system. For SARS-CoV-2, they checked their predictions against escape mutants that had been artificially derived, passed through antibody-rich serum until the selection pressure produced mutants that could evade the antibodies. (In other words, not anything we currently need to worry about in the real world.) The correlation was looser. The model flagged most of the true escapees but also sequences that weren’t.
Still, it’s a start that could give virologists a better grip on where natural mutations are headed. “This is a phenomenal way of narrowing down the entire universe of potential mutant viruses,” says Benhur Lee, a microbiologist at Mount Sinai’s Icahn School of Medicine who wasn’t involved in the work. The predictions are only as good as the data that goes into it, he adds. And as the researchers note, that means the model misses certain nuances, because escape is not always only a function of mutations the virus acquires. HIV is a good example. Sometimes, the sequence doesn’t change, and viral proteins are still recognized by antibodies, but those proteins are shielded by a type of sugary compound called a glycan.
Lee points out that the AI predictions are good for telling researchers what they already know. It correctly identified, for example, the two parts of the SARS-CoV-2 spike that researchers believe are more inclined to accumulate escape mutations, and another section that’s more stable, and thus a better antibody target. But it remains to be seen whether its predictions can provide truly novel insights. One area where the paper’s authors believe computational models will be most useful is in identifying so-called “combinatorial mutations” that involve many changes built on each other. But that will likely require much more data to make them produce good leads for lab scientists like Lee.
The next step, which will begin this Friday with Bryson’s collaborators in another lab, will involve creating some of the predicted SARS-CoV-2 mutants in the lab and seeing how they fare against antibodies in serum taken from recovered and vaccinated individuals. They’ll be using what’s known as a pseudotyped virus, which can test how well the antibodies neutralize a particular variation of the virus, but are not dangerously infectious. They’ll also test a few sequences picked up in efforts to sequence viral samples from Covid-19 patients that the model suggested were more primed for escape than others, Bryson says.
The lab members are wondering whether their analogy may apply in other situations. Could a similar model predict if an immune system will grow intolerant of a particular cancer treatment, or how a tumor mutation might evolve to evade the body’s controls? With the right data, Bryson’s lab would like to try it out. “A good analogy can go a long way,” he says.
Updated 1/14/21 at 4:00pm PT to clarify that the research described in the article took place in Bonnie Berger's lab at MIT.