A next-generation AI protein folder that could help science? Meta is for something

Meta AI researchers say they have developed the largest protein folding model of its kind to date, and it is capable of predicting the structure of more than 600 million proteins.

The team published the 15-billion-parameter ESM-2 transformer-based model and a database of its protein structure predictions, called the ESM Metagenomic Atlas, on Tuesday. This database includes forms of proteins that have not yet been observed by scientists.

Proteins are complex biological molecules that contain up to 20 types of amino acids and perform all kinds of biological functions in organisms. Crucially, they fold into intricate 3D structures, whose shape is vital to their function; knowing their shape helps scientists understand how they work, and from that helps them find ways to mimic, alter, or counteract that behavior.

Unfortunately, you can’t take the amino acid formula and immediately figure out the final structure. You can do simulations or experimentation to find out, but that takes a lot of time. These days, you can give the right machine learning software the chemical composition of a protein and the model will predict the structure quickly and accurately, relatively speaking.

In fact, DeepMind demonstrated this with its AlphaFold model, which won the CASP Biennial International Computational Protein Folding Competition in 2020. Given an input string of amino acids, AlphaFold and other machine learning software can generate its corresponding three-dimensional structure.

Since then, London-based DeepMind researchers have improved their system to predict the structure of more than 200 million proteins known to science. Meta’s latest ESM system has gone further, predicting hundreds of millions more after being trained on millions of protein sequences.

A preprint paper by the Meta team – Lin et al – explaining the design of the ESM-2 can be found here. Interestingly, according to the researchers, the system is actually a large language model made to “learn evolutionary patterns and generate accurate end-to-end structure predictions directly from a protein’s sequence.” AlphaFold, for example, is not a language model and uses a different approach.

As the boffins point out in their paper, these large language models can be used for much more than handling human languages: “Modern language models containing tens or hundreds of billions of parameters develop skills such as translating languages, common sense reasoning, and math problem solving, all without explicit supervision.

“These observations raise the possibility that language models trained on protein sequences may show a parallel form of emergence.”

The result is ESM-2, which although a language model has been taught to predict the physical shape of a protein from a text string representing its amino acids.

ESM-2 is the largest model of its kind, and apparently predicts structures faster than similar systems; it’s up to 60 times faster than previous state-of-the-art systems like AlphaFold or Rosetta, which can take more than ten minutes to generate an output, according to Meta.

The model was able to create the ESM Metagenomic Atlas, predicting more than 600 million structures from the MGnify90 protein database in just two weeks using 2,000 GPUs. On a single Nvidia V100 GPU, it takes only 14.2 seconds to simulate a protein consisting of 384 amino acids. According to the paper, Meta apparently said that its system mostly, but not completely, matched AlphaFold in accuracy, although its speed is the key, as it allows it to predict more proteins.

“With today’s state-of-the-art computational tools, predicting structures for hundreds of millions of protein sequences in a practical timeframe could take years, even using the resources of a major research institution. Making predictions at scale of metagenomics the advance in the speed of prediction is fundamental,” said the owner of Facebook.

Meta hopes that ESM-2 and the ESM Metagenomic Atlas will help advance science by helping scientists studying evolutionary history or tackling disease and climate change. “To further expand this work, we are studying how language models can be used to design new proteins and contribute to solving challenges in health, disease and the environment,” the company concluded. ®

Leave a Comment

Your email address will not be published. Required fields are marked *