Blog

AlphaFold 2 Through the Context of BERT
Meghan Heintz
Understanding AI applications in bio for machine learning engineers

Photo by Google DeepMind on Unsplash

AlphaFold 2 and BERT were both developed in the cradle of Google’s deeply lined pockets in 2018 (albeit by different departments: DeepMind and Google AI). They represented huge leaps forward in state-of-the-art models for natural language processing (NLP) and biology respectively. For BERT, this meant topping the leaderboard on benchmarks like GLUE (General Language Understanding Evaluation) and SQuAD (Stanford Question Answering Dataset). For AlphaFold 2 (hereafter just referred to as AlphaFold), it meant achieving near-experimental accuracy in predicting 3D protein structures. In both cases, these advancements were largely attributed to the use of transformer architecture and the self-attention mechanism.

I expect most machine learning engineers have a cursory understanding of how BERT or Bidirectional encoder representations from transformers work with language but only a vague metaphorical understanding of how the same architecture is applied to the field of biology. The purpose of this article is to explain the concepts behind the development and success of AlphaFold through the lens of how they compare and contrast to BERT.

Forewarning: I am a machine learning engineer and not a biologist, just a curious person.

BERT Primer

Before diving into protein folding, let’s refresh our understanding of BERT. At a high level, BERT is trained by masked token prediction and next-sentence prediction.

Example masked token prediction where “natural” was the masked token in the target sentence. (All images, unless otherwise noted, are by the author)

BERT falls into the sequence model family. Sequence models are a class of machine learning models designed to handle and make sense of sequential data where the order of the elements matters. Members of the family include Recurrent Neural Nets (RNNs), LSTMs (Long Short Term Memory), and Transformers. As a Transformer model (like its more famous relative, GPT), a key unlock for BERT was how training could be parallelized. RNNs and LSTMs process sequences sequentially, which slows down training and limits the applicable hardware. Transformer models utilize the self-attention mechanism which processes the entire sequence in parallel and allows training to leverage modern GPUs and TPUs, which are optimized for parallel computing.

Processing the entire sequence at once not only decreased training time but also improved embeddings by modeling the contextual relationships between words. This allows the model to better understand dependencies, regardless of their position in the sequence. A classic example illustrates this concept: “I went fishing by the river bank” and “I need to deposit money in the bank.” To readers, bank clearly represents two distinct concepts, but previous models struggled to differentiate them. The self-attention mechanism in transformers enables the model to capture these nuanced differences. For a deeper dive into this topic, I recommend watching this Illustrated Guide to Transformers Neural Network: A step by step explanation.

Example sentences where previous NLP models would have failed to differentiate the two meanings of bank and river bank.

One reason RNNs and LSTMs struggle is because they are unidirectional i.e. they process a sentence from left to right. So if the sentence was rewritten “At the bank, I need to deposit money”, money would no longer clarify the meaning of bank. The self-attention mechanism eliminates this fragility by allowing each word in the sentence to “attend” to every other word, both before and after it making it “bidirectional”.

AlphaFold and BERT Comparison

Now that we’ve reviewed the basics of BERT, let’s compare it to AlphaFold. Like BERT, AlphaFold is a sequence model. However, instead of processing words in sentences, AlphaFold’s inputs are amino acid sequences and multiple sequence alignments (MSAs), and its output/prediction is the 3D structure of the protein.

Let’s review what these inputs and outputs are before learning more about how they are modeled.

First input: Amino Acid Sequences

Amino acid sequences are embedded into high-dimensional vectors, similar to how text is embedded in language models like BERT.

Reminder from your high school biology class: the specific sequence of amino acids that make up a protein is determined by mRNA. mRNA is transcribed from the instructions in DNA. As the amino acids are linked together, they interact with one another through various chemical bonds and forces, causing the protein to fold into a unique three-dimensional structure. This folded structure is crucial for the protein’s function, as its shape determines how it interacts with other molecules and performs its biological roles. Because the 3D structure is so important for determining the protein’s function, the “protein folding” problem has been an important research problem for the last half-century.

Bio 101 reminder on the relationship between DNA, mRNA, and Amino Acid Sequences

Before AlphaFold, the only reliable way to determine how an amino acid sequence would fold was through experimental validation through techniques like X-ray crystallography, NMR spectroscopy (nuclear magnetic resonance), and Cryo-electron microscopy (cryo-EM). Though accurate, these methods are time-consuming, labor-intensive, and expensive.

So what is an MSA (multiple sequence alignment) and why is it another input into the model?

Second input: Multiple sequence alignments, represented as matrices in the model.

Amino acid sequences contain the necessary instructions to build a protein but also include some less important or more variable regions. Comparing this to language, I think of these less important regions as the “stop words” of protein folding instructions. To determine which regions of the sequence are the analogous stop words, MSAs are constructed using homologous (evolutionarily related) sequences of proteins with similar functions in the form of a matrix where the target sequence is the first row.

Similar regions of the sequences are thought to be “evolutionarily conserved” (parts of the sequence that stay the same). Highly conserved regions across species are structurally or functionally important (like active sites in enzymes). My imperfect metaphor here is to think about lining up sentences from Romance languages to identify shared important words. However, this metaphor doesn’t fully explain why MSAs are so important for predicting the 3D structure. Conserved regions are so critical because they allow us to detect co-evolution between amino acids. If two residues tend to mutate in a coordinated way across different sequences, it often means they are physically close in the 3D structure and interact with each other to maintain protein stability. This kind of evolutionary relationship is difficult to infer from a single amino acid sequence but becomes clear when analyzing an MSA.

An imperfect metaphor for MSAs: Like comparing similar words in Romance languages (e.g., “branches”: ramas, branches, rami, ramos, ramuri, branques), MSAs align sequences to reveal evolutionary connections, tracing shared origins through small variations.

Here is another place where the comparison of natural language processing and protein folding diverges; MSAs must be constructed and researchers often manually curate them for optimal results. Biologists use tools like BLAST (Basic Local Alignment Search Tool) to search their target sequences to find “homologs” or similar sequences. If you’re studying humans, this could mean finding sequences from other mammals, vertebrates, or more distant organisms. Then the sequences are manually selected considering things like comparable lengths and similar functions. Including too many sequences with divergent functions degrades the quality of the MSA. This is a HUGE difference from how training data is collected for natural language models. Natural language models are trained on huge swaths of data that are hovered up from anywhere and everywhere. Biology models, by contrast, need highly skilled and contentious dataset composers.

What is being predicted/output?

In BERT, the prediction or target is the masked token or next sentence. For AlphaFold, the target is the 3D structure of the protein, represented as the 3D coordinates of protein atoms, which defines the spatial arrangement of amino acids in a folded protein. Each set of 3D coordinates is collected experimentally, reviewed, and stored in the Protein Data Bank. Recently solved structures serve as a validation set for evaluation.

The output of AlphaFold is typically the 3D structure of a protein, which consists of the x, y, z coordinates of the atoms that make up the protein’s amino acids.

How are the inputs and outputs tied together?

Both the target sequence and MSA are processed independently through a series of transformer blocks, utilizing the self-attention mechanism to generate embeddings. The MSA embedding captures evolutionary relationships, while the target sequence embedding documents local context. These contextual embeddings are then fed into downstream layers to predict pairwise interactions between amino acids, ultimately inferring the protein’s 3D structure.

Within each sequence, the pairwise residue (the relationship or interaction between two amino acids within a protein sequence) representation predicts spatial distances and orientations between acids, which are critical for modeling how distant parts of the protein come into proximity when folded. The self-attention mechanism allows the model to account for both local and long-range dependencies within the sequence and MSA. This is important because when a sequence is folded, residues that are far from each other in a sequence may end up close to each other spatially.

The loss function for AlphaFold is considerably more complex than the BERT loss function. BERT faces no spatial or geometric constraints and its loss function is much simpler because it only needs to predict missing words or sentence relationships. In contrast, AlphaFold’s loss function involves multiple aspects of protein structure (distance distributions, torsion angles, 3D coordinates, etc.), and the model optimizes for both ****geometric and spatial predictions. This component heavy loss function ensures that AlphaFold accurately captures the physical properties and interactions that define the protein’s final structure.

While there is essentially no meaningful post-processing required for BERT predictions, predicted 3D coordinates are reviewed for energy minimization and geometric refinement based on the physical principles of proteins. These steps ensure that predicted structures are physically viable and biologically functional.

Conclusion
- AlphaFold and BERT both benefit from the transformer architecture and the self-attention mechanism. These improvements improve contextual embeddings and faster training time with GPUs and TPUs.
- AlphaFold has a much more complex data preparation process than BERT. Curating MSAs from experimentally derived data is harder than vacuuming up a large corpus of text!
- AlphaFold’s loss function must account for spatial or geometric constraints and it’s much more complex than BERT’s.
- AlphaFold predictions require post-processing to confirm that the prediction is physically viable whereas BERT predictions do not require post-processing.
Thank you for reading this far! I’m a big believer in cross-functional learning and I believe as machine learning engineers we can learn more by challenging ourselves to learn outside our immediate domains. I hope to continue this series on Understanding AI Applications in Bio for Machine Learning Engineers throughout my maternity leave. ❤

AlphaFold 2 Through the Context of BERT was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
AlphaFold 2 Through the Context of BERT

Go Here to Read this Fast! AlphaFold 2 Through the Context of BERT
October 7, 2024
Fourth watchOS 11.1, tvOS 18.1, and visionOS 2.1 have arrived

Continuing its beta testing process for this series, Apple has issued fourth developer builds of watchOS 11.1, tvOS 18.1, and visionOS 2.1.

watchOS 11 introduces better cycle tracking, more customizable Activity Rings, and more personalization.

This period of beta releases is a bit unusual, as Apple is providing builds in two distinct groups. This is caused by the beta testing of iOS 18.1, iPadOS 18.1, and macOS Sequoia 15.1, to hammer out issues with Apple Intelligence.

Since those three have been in testing far earlier, the beta is split into two groups, with Apple Intelligence-infused builds separate from others that don’t have it.

Continue Reading on AppleInsider | Discuss on our Forums

Go Here to Read this Fast! Fourth watchOS 11.1, tvOS 18.1, and visionOS 2.1 have arrived

Originally appeared here:
Fourth watchOS 11.1, tvOS 18.1, and visionOS 2.1 have arrived

October 7, 2024
Apple’s six developer betas land for iOS 18.1, iPadOS 18.1, macOS Sequoia 15.1

Apple’s testing of Apple Intelligence continues, with the sixth developer betas of iOS 18.1, iPadOS 18.1, and macOS Sequoia 15.1 now available.

Examples of Apple Intelligence at work.

The sixth developer betas of iOS 18.1, iPadOS 18.1, and macOS Sequoia 15.1, arrive after the third builds of visionOS 2.1, tvOS 18.1, and watchOS 11.1, which arrived on October 1. The respective fifth and second builds landed on September 23, while the fourth and first respective builds were introduced on September 17.

The difference in build counts is due to Apple beta testing tvOS 18, watchOS 11, and visionOS 2 at the same time as the Apple Intelligence-infused iOS 18.1, iPadOS 18.1, and macOS Sequoia 15.1.

Continue Reading on AppleInsider | Discuss on our Forums

Go Here to Read this Fast!

Apple’s six developer betas land for iOS 18.1, iPadOS 18.1, macOS Sequoia 15.1

Originally appeared here:

Apple’s six developer betas land for iOS 18.1, iPadOS 18.1, macOS Sequoia 15.1

October 7, 2024
Zombie-horror ‘Resident Evil 2’ heads to Mac on Dec 31

Capcom’s “Resident Evil 2” is on the way, with the zombie-based game lurching into the Mac App Store on December 31.

Resident Evil 2 [Mac App Store]

The Resident Evil franchise has been slowly moving onto Apple’s ecosystem, with Resident Evil 4 and Resident Evil 7: Biohazard having already arrived on Mac and Apple’s other platforms in 2024. There is one more fright-fest on the way, in the form of Resident Evil 2.

Briefly mentioned as on the way over the summer, Resident Evil has since surfaced in the Mac App Store. It is listed as available for preorder, with an expected release date of December 31, 2024.

Continue Reading on AppleInsider | Discuss on our Forums

Go Here to Read this Fast!

Zombie-horror ‘Resident Evil 2’ heads to Mac on Dec 31

Originally appeared here:

Zombie-horror ‘Resident Evil 2’ heads to Mac on Dec 31

October 7, 2024
Apple iPads are back down to as low as $199 for Prime Big Deals Days

Amazon’s iPad deals for Prime Day offer discounts of up to $200 off and prices as low as $199.

Prime Day iPad deals are in effect now. [Apple/AppleInsider]

Prime Day officially starts at midnight Pacific Time on Oct. 8, but Amazon is getting a head start by discounting iPads today. The entire range is on sale, ranging from the budget-friendly $199 9th Gen to $200 off select iPad Pros. And fresh Apple Pencil deals are in effect as well to pair with a new or existing iPad.

Below is a curated list of the top offers:

Continue Reading on AppleInsider

Go Here to Read this Fast! Apple iPads are back down to as low as $199 for Prime Big Deals Days

Originally appeared here:
Apple iPads are back down to as low as $199 for Prime Big Deals Days

October 7, 2024
Congo government plans to crack down on buyers like Apple for conflict minerals

The Democratic Republic of Congo (DRC) is considering legal action against tech companies such as Apple to reduce the amount of conflict minerals sourced from its eastern provinces.

Mine with acid lake | Credit: dimitrisvetsikas1969 on Pixabay

The Eastern Congo is the world’s biggest supplier of tantalum, a conductive metal used in devices like the iPhone. Because of this, more than 100 militias have sought to control the tantalum trade.

In 2024, rebel group M23 took control of the largest tantalum mine, Rubaya. Congo, the US, and United Nations experts say Rwanda has sent thousands of troops to Congo to back the M23 — though Rwanda denies the allegations.

Continue Reading on AppleInsider | Discuss on our Forums

Go Here to Read this Fast! Congo government plans to crack down on buyers like Apple for conflict minerals

Originally appeared here:
Congo government plans to crack down on buyers like Apple for conflict minerals

October 7, 2024
SmartThings adds Matter 1.3, eufy S3 Cam Pro launches, & more on HomeKit Insider

On this episode of the HomeKit Insider Podcast, eufy launches a new Apple Home camera system, Sonos makes a pledge, and more gear is released.

HomeKit Insider Podcast

As we start to recap the news, we kick it off with the latest news from Sonos. After last adding alarms back to the app in their most recent update, they now have a new commitment to quality.

Their CEO laid out a multi-point promise that they will abide by to ensure the highest level of quality for their products going forward. This includes a new quality ombudsperson, regular reports, and internal checks.

Continue Reading on AppleInsider | Discuss on our Forums

Go Here to Read this Fast!

SmartThings adds Matter 1.3, eufy S3 Cam Pro launches, & more on HomeKit Insider

Originally appeared here:

SmartThings adds Matter 1.3, eufy S3 Cam Pro launches, & more on HomeKit Insider

October 7, 2024
Best Prime Day Fire TV deals to shop in October 2024

October Prime Day is just one day away, so it’s a great time to buy a new TV for a discount, especially if you’re interested in Amazon’s Fire TV brand.

Go Here to Read this Fast! Best Prime Day Fire TV deals to shop in October 2024

Originally appeared here:
Best Prime Day Fire TV deals to shop in October 2024

October 7, 2024
The 30+ best computer monitor deals for October Prime Day

We’re seeing discounts on the top computer monitors on the eve of Amazon’s October Prime Day sale. Below is a detailed list of all of the best monitor days we found.

Go Here to Read this Fast! The 30+ best computer monitor deals for October Prime Day

Originally appeared here:
The 30+ best computer monitor deals for October Prime Day

October 7, 2024
Best Prime Day laptop deals to shop in October 2024

Amazon’s October Prime Day starts tomorrow, but we’ve got our eyes on early deals live now, including sales on Apple MacBooks and laptops from Asus, Lenovo, Microsoft, and more.

Go Here to Read this Fast! Best Prime Day laptop deals to shop in October 2024

Originally appeared here:
Best Prime Day laptop deals to shop in October 2024

October 7, 2024

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Blog

Understanding AI applications in bio for machine learning engineers

BERT Primer

AlphaFold and BERT Comparison

So what is an MSA (multiple sequence alignment) and why is it another input into the model?

What is being predicted/output?

How are the inputs and outputs tied together?

Conclusion