Transformers — Attention is all you need!

Why did the Transformer model go to therapy?

Because it had trouble paying attention to its own emotions and kept getting lost in self-attention!

Introduction

The Transformer model is a groundbreaking deep learning architecture that has had a profound impact on the field of natural language processing (NLP) and machine translation. It was introduced in 2017 in a seminal paper titled “Attention is All You Need” by Vaswani et al. Since then, it has become the preferred model for various sequence-to-sequence tasks, surpassing traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs).

At its core, the Transformer model employs a self-attention mechanism to capture interdependencies between different positions in a sequence. Unlike RNNs, which process sequential data sequentially, the Transformer can simultaneously process the entire sequence. This parallelization greatly improves training and inference efficiency, making it particularly adept at handling lengthy sequences.

The self-attention mechanism allows the Transformer to assign varying weights to different elements in a sequence when computing its representation. This enables the model to focus on the most relevant information by assigning attention scores based on each element’s relevance to others. Consequently, the Transformer can capture long-range dependencies and learn intricate relationships between sequence elements.

Positional encoding is another crucial aspect of the Transformer model. Since the model lacks an inherent understanding of element order in a sequence, positional encoding is incorporated into input embeddings. This encoding provides the model with information about the relative and absolute positions of elements within the sequence.

The Transformer model comprises an encoder and a decoder. In tasks such as machine translation, the encoder processes the input sentence, while the decoder generates the output sentence in the target language. Both the encoder and decoder consist of multiple layers of self-attention and feed-forward neural networks. This layering facilitates the model’s ability to capture increasingly abstract representations as information flows through the network.

Thanks to its capacity to handle long-range dependencies, parallelize computations, and effectively capture contextual information, the Transformer model has garnered widespread adoption in various NLP applications. It has achieved state-of-the-art results in machine translation, text summarization, question answering, sentiment analysis, and other tasks. The Transformer’s versatility, scalability, and powerful attention mechanism have solidified its position as a fundamental component of contemporary deep learning architectures in NLP.

Let’s first understand Self Attention —

Self-attention, also known as intra-attention or scaled dot-product attention, is a core component of the Transformer model, enabling it to capture long-range dependencies and contextual relationships between different positions in a sequence. Self-attention is employed within both the encoder and decoder sections of the Transformer architecture.

Here’s an in-depth breakdown of the self-attention mechanism in the Transformer:

  1. Key, Query, and Value: The self-attention mechanism involves three learnable linear transformations: the key (K), query (Q), and value (V) matrices. These transformations are applied to the input sequence’s embeddings, generating a corresponding key, query, and value vectors. Each position in the sequence has its own key, query, and value vectors.
  2. Calculating Attention Scores: The attention scores determine the importance of each position in the sequence relative to other positions. They are computed by taking the dot product between the query vectors and the corresponding key vectors. The dot products are scaled by the square root of the dimension of the key vectors to avoid overly large or small values.
  3. Attention Weights: The attention scores are transformed into attention weights through a SoftMax function, which normalizes the scores and ensures they form a valid probability distribution. The attention weights represent the relative importance of each position in the sequence for the current position.
  4. Weighted Sum of Values: The attention weights are applied to the value vectors using element-wise multiplication. This operation results in a weighted sum of the values, where the values with higher attention weights contribute more to the final representation of the current position. The weighted sum represents the attended output, capturing the contextual information and relevant dependencies.

By utilizing self-attention, the Transformer model can effectively capture global dependencies and model interactions between distant positions in the sequence. The attention mechanism enables the model to focus on different parts of the sequence dynamically, assigning higher weights to the most relevant positions for each query.

The self-attention mechanism, with its ability to model complex relationships and dependencies, plays a pivotal role in various natural language processing tasks. It allows the model to effectively process and represent sequential data, leading to improved performance in machine translation, sentiment analysis, text generation, and other tasks.

Fig 1.1 Transformer Architecture

As we can see, Transformer is basically Encoder + Decoder,

So let’s understand each individually —

Encoder

Fig 1.2 — Encoder (The Part in the Red Box)

The part in the red box is basically the encoder. It has the following components —

  1. Input Embedding: The input embedding is the initial step in the Transformer model. It transforms the input sequence into continuous vector representations. Each element in the sequence, such as a word or character, is mapped to a high-dimensional vector using an embedding matrix. This embedding helps the model capture the semantic meaning and relationships between different elements in the sequence.
  2. Positional Encoding: Since the Transformer model lacks inherent knowledge of the element order in a sequence, positional encoding is introduced. It provides information about the relative and absolute positions of the elements. Positional encoding involves adding specific encoding vectors to the input embeddings. These vectors carry positional information, enabling the model to distinguish between the positions of elements in the sequence.
  3. Multi-head Attention: Multi-head attention is a critical component of the Transformer model that captures dependencies and relationships between different elements in the input sequence. It attends to different parts of the sequence simultaneously, using multiple sets of query, key, and value matrices. Each attention head learns to focus on different aspects of the sequence, enhancing the model’s ability to capture various types of information. Within the multi-head attention mechanism, three linear transformations generate query (Q), key (K), and value (V) matrices from the input representations. These matrices compute attention scores, indicating the importance of each element in relation to others. The output of the multi-head attention is obtained by weighted summation of the values, where the weights are determined by the attention scores.
  4. Feed-Forward Neural Network(FFNN): The feed-forward neural network is a crucial component within each layer of the Transformer model. It processes the output of the multi-head attention mechanism. The feed-forward network consists of two linear transformations with a non-linear activation function, typically a Rectified Linear Unit (ReLU), applied in between. This allows the model to capture complex interactions and create higher-level representations of the sequence. Residual connections and layer normalization are applied to the output of the feed-forward neural network, similar to other sub-layers in the Transformer model. Residual connections enable improved information flow and address vanishing gradient problems, while layer normalization stabilizes training by normalizing the representations. We will see more about FFNN.

Decoder

Fig 1.3 — Decoder (The Part in the Red Box)

The part in the red box is basically the Decoder. It has the following components —

So basically, it has just one additional component than Encoder, and that is Masked Multi-Head Attention.

Masked Multi-Head Attention —

Within the Transformer model’s decoder, a modified version of multi-head attention known as “masked multi-head attention” is employed. It allows the model to attend selectively to previous positions in the target sequence during training, ensuring the autoregressive nature of the predictions.

The masked multi-head attention in the decoder consists of the following components:

  1. Masking: Prior to calculating attention scores, a masking mechanism is applied to restrict the decoder’s access to future positions in the target sequence during training. This restriction ensures that the model generates predictions based only on the previously generated positions, maintaining the autoregressive property.
  2. Query: The query inputs for the masked multi-head attention are derived from the decoder’s current position. These query vectors represent the current target position and are utilized to compute attention scores when attending to the encoder’s output representations.
  3. Key and Value: The key and value inputs for the masked multi-head attention are obtained from the encoder’s output representations. These inputs provide information about the source sequence and enable the decoder to attend to relevant positions in the source sequence.

Similar to the standard multi-head attention, masked multi-head attention computes attention scores by comparing the query vectors with the key vectors. These attention scores determine the importance of each source position for the current target position. The weighted sum of the value vectors, using the attention scores as weights, yields the attended output.

Fig 1.4 — Feed Forward Neural Network

The feed-forward neural network —

It is a vital element within each layer of the Transformer model, present in both the encoder and decoder. It plays a crucial role in processing the representations generated by the self-attention mechanism and capturing complex interactions within the sequence.

Here are the key aspects and operations involved in the feed-forward neural network within the Transformer:

  1. Structure: The feed-forward neural network comprises two linear transformations with a non-linear activation function applied in between. These linear transformations are fully connected layers that project the input representations into a higher-dimensional space and then back to the original dimension.
  2. Non-linear Activation: Typically, a Rectified Linear Unit (ReLU) activation function is applied after the first linear transformation in the feed-forward neural network. This introduces non-linearity to the network, enabling it to model complex patterns and relationships within the data.
  3. Position-wise Operation: The feed-forward neural network operates independently on each position within the sequence. This position-wise operation ensures that the network can capture local patterns and dependencies within the sequence while being agnostic to the overall order of elements.
  4. Parameters: The feed-forward neural network consists of learnable parameters, including weights and biases, associated with linear transformations. These parameters are optimized during the training process to minimize the discrepancy between the model’s predictions and the target outputs.

The feed-forward neural network enhances the representations generated by the self-attention mechanism, introducing non-linear transformations and enabling the model to capture complex relationships between elements in the sequence. By operating on each position independently, the network can capture local patterns and generate higher-level representations of the sequence.

To facilitate information flow and stabilize training, residual connections, and layer normalization are typically applied after the feed-forward neural network. Residual connections allow for the direct flow of information from the input to the output of the network, aiding in gradient propagation. Layer normalization helps normalize the representations, improving stability and performance.

So that’s some overview of the Transformer Neural Network.

There are many applications for it, which will post about in future blogs. So stay tuned.

With this, we come to a close to this blog. Did you like what you read? Would you like to know more about me? Well, guess what, I’m on LinkedIn so go ahead and reach out!

With ❤,
Jatin S.

Image Credit — https://www.tensorflow.org/