Summary of Transformer (deep learning architecture)

Summary Transformer (deep learning architecture) - Wikipedia en.wikipedia.org

7,267 words - html page - View html page

One Line

Google's transformer model uses multi-head attention to efficiently convert text to numerical tokens, enabling its use in various applications.

Slides

Slide Presentation (10 slides)

Copy slides outline Copy embed code Download as Word

Transforming Deep Learning: The Attention Revolution

Source: en.wikipedia.org - html - 7,267 words - view

Transformer: Google's Groundbreaking Architecture

• Utilizes multi-head attention for efficient text processing

• Converts text to numerical tokens for contextual understanding

• Eliminates recurrent units, reducing training time

• Enables diverse applications across NLP, vision, and audio

Attention Mechanism: The Key Innovation

• Predecessors like fast weight controller and decomposable attention

• Normalization layers solved convergence issues

• Enabled highly parallelizable and efficient processing

Transformers' Widespread Success

• Excelled in NLP tasks like translation, summarization, and NER

• Achieved breakthroughs in computer vision and protein folding

• Facilitated development of pre-trained systems like GPTs and BERT

[A visual showing the transformer architecture and its applications]

Transformer Components: The Building Blocks

• Tokenizers for text conversion to numerical representations

• Embedding layers for token contextualization

• Transformer layers for attention-based processing

• Un-embedding layers for final output generation

Architectural Advancements and Optimizations

• FlashAttention-2 for ultra-fast GPU implementation

• Random Feature Attention and Performer for linear-time attention

• Speculative decoding for efficient parallel token generation

[A visual comparing performance metrics of different transformer variants]

Expanding Horizons: Transformers Beyond Text

• Vision transformers treat image patches as tokens

• Conformer and Whisper for speech recognition via spectrograms

• Perceivers and diffusion transformers for multi-modal learning

The Transformer Revolution

• Revolutionized deep learning with efficient attention-based processing

• Enabled breakthroughs across diverse domains and modalities

• Paved the way for continued innovation and real-world impact

Key Points

The transformer is a deep learning architecture developed by Google, based on the multi-head attention mechanism, and used in natural language processing, computer vision, audio, and multi-modal processing
Transformers typically undergo self-supervised learning involving unsupervised pretraining followed by supervised fine-tuning, and have had great success in various NLP tasks such as machine translation, document summarization, and named entity recognition
The transformer architecture has been implemented in standard deep learning frameworks like TensorFlow and PyTorch, and consists of key components like tokenizers, embedding layers, transformer layers, and un-embedding layers
Advancements in the transformer architecture include the development of FlashAttention-2 for efficient GPU implementation, Random Feature Attention and Performer for linear-time attention computation, and speculative decoding for more efficient token generation
Transformers can be adapted for modalities beyond text, such as computer vision and speech recognition, by treating input data as a sequence of tokens, and have seen significant advancements in terms of efficiency, adaptability, and performance

Summaries

19 word summary

Google's transformer uses multi-head attention to convert text to numerical tokens, reducing training time. It's used in various applications.

60 word summary

The transformer, developed by Google, uses multi-head attention to convert text into numerical tokens and has no recurrent units, reducing training time. It is used in natural language processing, computer vision, audio, and multi-modal processing, leading to the development of pre-trained systems like GPTs and BERT. Predecessors of the attention mechanism were added to gated recurrent neural networks before transformers.

138 word summary

The transformer, developed by Google, uses the multi-head attention mechanism to convert text into numerical tokens and has no recurrent units, reducing training time. It is used in natural language processing, computer vision, audio, and multi-modal processing, leading to the development of pre-trained systems like GPTs and BERT. Predecessors of the attention mechanism were added to gated recurrent neural networks before transformers. The plain transformer architecture had convergence issues, but normalizing layers before multiheaded attention solved this. Transformers typically undergo self-supervised learning and have been successful in NLP, machine translation, document summarization, and video understanding. They have also been implemented in standard deep learning frameworks and can be adapted for modalities beyond text, such as computer vision and speech recognition. Key advancements include FlashAttention-2, Random Feature Attention, Performer, speculative decoding, Vision transformers, Conformer, Whisper, Perceivers, and diffusion transformers.

403 word summary

The transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism. It converts text to numerical representations called tokens, contextualizes each token within a context window with other tokens, and has no recurrent units, requiring less training time than previous architectures. The transformer architecture is used in natural language processing, computer vision, audio, and multi-modal processing, and has led to the development of pre-trained systems such as generative pre-trained transformers (GPTs) and BERT.

Before transformers, predecessors of the attention mechanism were added to gated recurrent neural networks, such as LSTMs and GRUs. In 1992, the fast weight controller was proposed as an alternative to recurrent neural networks that can learn “internal spotlights of attention”. In 2016, highly parallelizable decomposable attention was successfully combined with a feedforward network. The plain transformer architecture had difficulty converging, but this was solved by normalizing layers before (instead of after) multiheaded attention by Xiong et al.

Transformers typically undergo self-supervised learning involving unsupervised pretraining followed by supervised fine-tuning. The transformer has had great success in natural language processing (NLP), machine translation, document summarization, and video understanding. It has also been successful in other fields, such as computer vision and protein folding applications.

The transformer model has been implemented in standard deep learning frameworks such as TensorFlow and PyTorch. All transformers have the same primary components: Tokenizers, a single embedding layer, transformer layers, and an un-embedding layer. Transformers may use other positional encoding methods than sinusoidal, such as RoPE (rotary positional embedding). ALiBi (Attention with Linear Biases) is an additional positional encoder that is directly plugged into the attention mechanism.

Key advancements in the field include the development of FlashAttention-2, which achieves up to 230 TFLOPs/s on A100 GPUs, and the introduction of Random Feature Attention and Performer. Additionally, speculative decoding has been proposed as a method to use spare compute power by computing several tokens in parallel, allowing for more efficient generation of tokens.

Transformers can be adapted for modalities beyond text, such as computer vision and speech recognition. Vision transformers break down input images into patches and treat them like tokens in a standard transformer. Conformer and Whisper follow a similar pattern for speech recognition, turning the speech signal into a spectrogram and treating it like an image. Perceivers can learn from large amounts of heterogeneous data, while diffusion transformers facilitate the use of the transformer architecture for diffusion-based image production.

700 word summary

Transformers typically undergo self-supervised learning involving unsupervised pretraining followed by supervised fine-tuning. Pretraining is typically done on a larger dataset than fine-tuning, due to the limited availability of labeled training data. The transformer has had great success in natural language processing (NLP), for example in machine translation, document summarization, document generation, named entity recognition (NER), biological sequence analysis, writing computer code based on requirements expressed in natural language, and video understanding. It has also been successful in other fields, such as computer vision and protein folding applications.

In conclusion, the transformer architecture has revolutionized deep learning by providing a more efficient and effective way to process and analyze data in various fields such as natural language processing, computer vision, and audio processing. It has paved the way for the development of pre-trained systems and has been implemented in standard deep learning frameworks. The transformer's success in various applications demonstrates its potential for real-world use and its ability to perform a wide variety of tasks in natural language processing.

The Attention softmax formula is a key aspect of the Transformer architecture, allowing for pretraining on short context windows and finetuning on longer context windows. It can be combined with any positional encoder and is directly plugged into the attention mechanism. Relative Position Encodings is similar to ALiBi, but more generic, using a Toeplitz matrix. FlashAttention is an algorithm that implements the transformer attention mechanism efficiently on a GPU, performing matrix multiplications in blocks to minimize data copying between GPU caches. FlashAttention-2 offers enhancements in work partitioning and parallelism, achieving up to 230 TFLOPs/s on A100 GPUs. Multi-Query Attention changes the multiheaded attention mechanism, increasing inference speed without affecting model quality or training speed.

Key advancements in the field include the development of FlashAttention-2, which achieves up to 230 TFLOPs/s on A100 GPUs, and the introduction of Random Feature Attention and Performer, which use Fourier random features to compute attention matrices in linear time. Additionally, speculative decoding has been proposed as a method to use spare compute power by computing several tokens in parallel, allowing for more efficient generation of tokens.

In conclusion, the Transformer architecture has seen significant advancements in recent years, particularly in terms of efficiency, adaptability to different modalities, and improvements in performance. These advancements have the potential to further enhance the capabilities of transformers in various applications, including natural language processing, computer vision, and speech recognition.

837 word summary

The original transformer uses ReLU activation function. Other activation functions were developed, such as SwiGLU. In large language models, the terminology is somewhat different than the terminology used in the original Transformer paper: “encoder only” refers to full encoder, full decoder; “encoder-decoder” refers to full encoder, autoregressive decoder; and “decoder only” refers to autoregressive encoder, autoregressive decoder.

Speculative decoding uses spare compute power by computing several tokens in parallel, allowing for a more efficient generation of tokens. Sub-quadratic transformers reduce the computational load for long inputs, with models like Reformer and ETC/BigBird using locality-sensitive hashing and reversible layers. Attention-free transformers reduce memory size while retaining the advantages of a transformer. Random Feature Attention uses Fourier random features to compute the attention matrix in linear time. Performer uses the same Random Feature Attention but with Gram-Schmidt processed parameters.

Key advancements in the field include the development of FlashAttention-2, which achieves up to 230 TFLOPs/s on A100 GPUs, and the introduction of Random Feature Attention and Performer, which use Fourier random features to compute attention matrices in linear time. Additionally, speculative decoding has been proposed as a method to use spare compute power by computing several tokens in parallel, allowing for more efficient generation of tokens.

Raw indexed text (52,868 chars / 7,267 words / 1,496 lines)

Transformer (deep learning architecture) - Wikipedia

Home

Random

Nearby

Settings

Donate

About Wikipedia

Disclaimers

Transformer (deep learning architecture)

Article

Talk

Language

Watch

Edit

"Transformer architecture" redirects here. For the design of electrical transformers, see

Transformer

Learn more

This article

relies excessively on

references

primary sources

Please improve this article by adding

secondary or tertiary sources

Find sources:

"Transformer" deep learning architecture

news

newspapers

books

scholar

JSTOR

February 2023

Learn how and when to remove this template message

transformer

is a

deep learning

architecture developed by

Google

and based on the multi-head

attention

mechanism, proposed in a 2017 paper "

Attention Is All You Need

[1]

Text is converted to numerical representations called

tokens

, and each token is converted into a vector via looking up from a

word embedding

table.

[1]

At each layer, each token is then

contextualized

within the scope of the context window with other (unmasked) tokens via a parallel multi-head

attention mechanism

allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the

softmax

-based attention mechanism proposed by Bahdanau

et. al.

in 2014 for

machine translation

[2]

[3]

and the Fast Weight Controller, similar to a transformer, proposed in 1992.

[4]

[5]

[6]

Transformers have the advantage of having no recurrent units, and thus requires less training time than previous

recurrent neural architectures

, such as

long short-term memory

(LSTM),

[7]

and its later variation has been prevalently adopted for training

large language models

(LLM) on large (language) datasets, such as the

Wikipedia

corpus

and

Common Crawl

[8]

This architecture is now used not only in

natural language processing

and

computer vision

[9]

but also in audio

[10]

and multi-modal processing. It has also led to the development of

pre-trained systems

, such as

generative pre-trained transformers

(GPTs)

[11]

and

BERT

[12]

(Bidirectional Encoder Representations from Transformers).

Contents

Timeline

1.1

Predecessors

1.2

Decomposable attention

Training

2.1

Methods for stabilizing training

2.2

Pretrain-finetune

Applications

Implementations

Architecture

5.1

Input

5.2

Encoder-decoder architecture

5.3

Scaled dot-product attention

5.3.1

Multi-head attention

5.3.2

Masked attention

5.4

Encoder

5.4.1

Positional encoding

5.5

Decoder

5.6

Terminology

Subsequent work

6.1

Alternative activation functions

6.2

Alternative positional encodings

6.2.1

RoPE

6.2.2

ALiBi

6.2.3

Relative Position Encodings

6.3

Efficient implementation

6.3.1

FlashAttention

6.3.2

Multi-Query Attention

6.3.3

Speculative decoding

6.4

Sub-quadratic transformers

6.5

Multimodality

Further reading

Timeline of natural language processing models

Timeline

edit

In 1990, the

Elman network

, using a recurrent

neural network

, encoded each word in a training set as a vector, called a

word embedding

, and the whole vocabulary as a

vector database

, allowing it to perform such tasks as sequence-predictions that are beyond the power of a simple

multilayer perceptron

. A shortcoming of the static embeddings was that they didn't differentiate between multiple meanings of same-spelt words.

[13]

In 1992, the Fast Weight Controller was published by

Jrgen Schmidhuber

[4]

It learns to answer queries by programming the attention weights of another

neural network

through outer products of key vectors and value vectors called FROM and TO. The Fast Weight Controller was later shown to be equivalent to the unnormalized linear Transformer.

[6]

[5]

[14]

[15]

The terminology "learning internal spotlights of attention" was introduced in 1993.

[16]

In 1993, the

IBM alignment models

were used for

statistical machine translation

[17]

In 1997, a precursor of large language model, using

recurrent neural networks

, such as

long short-term memory

, was proposed.

In 2001, a one-billion-word large text corpus, scraped from the Internet, referred to as "very very large" at the time, was used for word

disambiguation

[18]

In 2012,

AlexNet

demonstrated the effectiveness of large neural networks for image recognition, encouraging large

artificial neural networks

approach instead of older, statistical approaches.

In 2014, a 380M-parameter

seq2seq

model for machine translation using two Long short-term Memory (

LSTMs

) networks was proposed by Sutskever at al.

[19]

The architecture consists of two parts. The

encoder

is an LSTM that takes in a sequence of tokens and turns it into a vector. The

decoder

is another LSTM that converts the vector into a sequence of tokens.

In 2014, gating proved to be useful in a 130M-parameter

seq2seq

model, which used a simplified

gated recurrent units

(GRUs). Bahdanau et al

[20]

showed that GRUs are neither better nor worse than gated LSTMs.

[21]

[22]

In 2014, Bahdanau et al.

[23]

improved the previous seq2seq model by using an "additive" kind of attention mechanism in-between two LSTM networks. It was, however, not yet the parallelizable (scaled "dot product") kind of attention, later proposed in the 2017 transformer paper.

In 2015, the relative performance of Global and Local (windowed) attention model architectures were assessed by Luong et al, a mixed attention architecture found to improve on the translations offered by Bahdanau's architecture, while the use of a local attention architecture reduced translation time.

[24]

In 2016,

Google Translate

gradually replaced the older

statistical machine translation

approach with the newer

neural-networks

-based approach that included a seq2seq model combined by LSTM and the "additive" kind of attention mechanism. They achieved a higher level of performance than the statistical approach, which took ten years to develop, in only nine months.

[25]

[26]

In 2017, the original (100M-sized) encoder-decoder transformer model with a faster (parallelizable or decomposable) attention mechanism was proposed in the "

Attention is all you need

" paper. As the model had difficulties converging, it was suggested that the

learning rate

should be linearly scaled up from 0 to maximal value for the first part of the training (i.e. 2% of the total number of training steps). The intent of the transformer model is to take a seq2seq model and remove its recurrent neural networks, but preserve its additive attention mechanism.

[1]

In 2018, in the

ELMo

paper, an entire sentence was processed before an embedding vector was assigning to each word in the sentence. A bi-directional LSTM was used to calculate such, deep contextualized embeddings for each word, improving upon the line of research from

bag of words

and

word2vec

In 2018, an encoder-only transformer was used in the (more than 1B-sized)

BERT

model, improving upon ELMo.

[27]

In 2020,

vision transformer

[28]

and speech-processing convolution-augmented transformer

[29]

outperformed

recurrent neural networks

, previously used for vision and speech.

In 2020, difficulties with converging the original transformer were solved by normalizing layers

before

(instead of after) multiheaded attention by Xiong et al. This is called

pre-LN

Transformer.

[30]

In 2023, uni-directional ("autoregressive") transformers were being used in the (more than 100B-sized) GPT-3 and other

OpenAI

GPT

models.

[31]

[32]

Predecessors

edit

Before transformers, predecessors of

attention mechanism

were added to gated recurrent neural networks, such as

LSTMs

and

gated recurrent units

(GRUs), which processed datasets sequentially. Dependency on previous token computations prevented them from being able to parallelize the attention mechanism. In 1992, fast weight controller was proposed as an alternative to recurrent neural networks that can learn "internal spotlights of attention".

[16]

[4]

In theory, the information from one token can propagate arbitrarily far down the sequence, but in practice the

vanishing-gradient problem

leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens.

The performance of old models was enhanced by adding an attention mechanism, which allowed a model to access any preceding point along the sequence. The attention layer weighs all previous states according to a learned measure of relevance, providing relevant information about far-away tokens. This proved to be especially useful in

language translation

, where far-away context can be essential for the meaning of a word in a sentence. The state vector has been accessible only after the

last

English word was processed while, for example, translating it from French by a LSTM model. Although in theory such a vector retains the information about the whole original sentence, in practice the information is poorly preserved. If an attention mechanism is added, the decoder is given access to the state vectors of every input word, not just the last, and can learn attention weights that dictate how much to attend to each input state vector. The augmentation of

seq2seq

models with the attention mechanism was first implemented in the context of machine translation by Bahdanau, Cho, and Bengio in 2014.

[2]

[3]

Decomposable attention

edit

In 2016, highly parallelizable

decomposable attention

was successfully combined with a

feedforward network

[33]

This indicated that attention mechanisms were powerful in themselves and that sequential recurrent processing of data was not necessary to achieve the quality gains of recurrent neural networks with attention. In 2017, Vaswani et al. also proposed replacing recurrent neural networks with self-attention and started the effort to evaluate that idea.

[1]

Transformers, using an attention mechanism, processing all tokens simultaneously, calculated "soft" weights between them in successive layers. Since the attention mechanism only uses information about other tokens from lower layers, it can be computed for all tokens in parallel, which leads to improved training speed.

Training

edit

Methods for stabilizing training

edit

The plain transformer architecture had difficulty converging. In the original paper

[1]

the authors recommended using learning rate warmup. That is, the learning rate should linearly scale up from 0 to maximal value for the first part of the training (usually recommended to be 2% of the total number of training steps), before decaying again.

A 2020 paper found that using

layer normalization

before

(instead of after) multiheaded attention and feedforward layers stabilizes training, not requiring learning rate warmup.

[30]

The GT

model integrates CWTE, SWTE, and TTE using a self-adaptive gate layer, enabling efficient and effective fusion of three types of features for end-to-end text-driven stock market prediction.

[34]

Pretrain-finetune

edit

Transformers typically undergo

self-supervised learning

involving

unsupervised

pretraining followed by

supervised

fine-tuning

. Pretraining is typically done on a larger dataset than fine-tuning, due to the limited availability of

labeled

training data. Tasks for pretraining and fine-tuning commonly include:

language modeling

[12]

next-sentence prediction

[12]

question answering

[8]

reading comprehension

sentiment analysis

[1]

paraphrasing

[1]

The T5 transformer paper

[35]

documents a large number of pretraining tasks. Some examples are:

restoring corrupted text:

Thank you me to your party week.

for inviting last

where the

means "end of output".

translation:

translate English to German: That is good.

Das ist gut.

judging the grammatical acceptability of a sentence (

CoLA

sentence):

The course is jumping well.

not acceptable

Applications

edit

The transformer has had great success in

natural language processing

(NLP), for example the tasks of

machine translation

and

time series

prediction. Many

large language models

such as

GPT-2

GPT-3

GPT-4

Claude

BERT

, XLNet, RoBERTa and

ChatGPT

demonstrate the ability of transformers to perform a wide variety of such NLP-related tasks, and have the potential to find real-world applications. These may include:

machine translation

document summarization

document generation

named entity recognition

(NER)

[36]

biological sequence analysis

writing computer code

based on requirements expressed in natural language.

video understanding

In addition to the NLP applications, it has also been successful in other fields, such as

computer vision

, or the

protein folding

applications (such as

AlphaFold

As an illustrative example,

Ithaca

is an encoder-only transformer with

three

output heads. It takes as input ancient Greek inscription as sequences of characters, but with illegible characters replaced with "-". Its three output heads respectively outputs probability distributions over Greek characters, location of inscription, and date of inscription.

[37]

Implementations

edit

The transformer model has been implemented in standard deep learning

frameworks

such as

TensorFlow

and

PyTorch

Transformers

is a library produced by

Hugging Face

that supplies transformer-based architectures and pretrained models.

[11]

Architecture

edit

An illustration of main components of the transformer model from the original paper, where

layer normalization

was performed after multiheaded attention. In a 2020 paper it was found that placing the layer normalization in front of the multiheaded attention (instead of after) improves the training stability

[30]

All transformers have the same primary components:

Tokenizers, which convert text into tokens.

A single embedding layer, which converts tokens and positions of the tokens into vector representations.

Transformer layers, which carry out repeated transformations on the vector representations, extracting more and more linguistic information. These consist of alternating attention and feedforward layers.

(optional) Un-embedding layer, which converts the final vector representations back to a probability distribution over the tokens.

Transformer layers can be one of two types,

encoder

and

decoder

. In the original paper both of them were used, while later models included only one type of them.

BERT

is an example of an encoder-only model;

GPT

are decoder-only models.

Input

edit

The input text is parsed into tokens by a tokenizer, most often a

byte pair encoding

tokenizer

, and each token is converted into a vector via looking up from a

word embedding

table. Then, positional information of the token is added to the word embedding.

Encoder-decoder architecture

edit

Like earlier

seq2seq

models, the original transformer model used an

encoder-decoder

architecture. The encoder consists of encoding layers that process the input tokens iteratively one layer after another, while the decoder consists of decoding layers that iteratively process the encoder's output as well as the decoder output's tokens so far.

The function of each encoder layer is to generate contextualized token representations, where each representation corresponds to a token that "mixes" information from other input tokens via self-attention mechanism. Each decoder layer contains two attention sublayers: (1) cross-attention for incorporating the output of encoder (contextualized input token representations), and (2) self-attention for "mixing" information among the input tokens to the decoder (i.e., the tokens generated so far during inference time).

[38]

[39]

Both the encoder and decoder layers have a

feed-forward neural network

for additional processing of the outputs and contain residual connections and layer normalization steps.

[39]

Scaled dot-product attention

edit

The transformer building blocks are scaled

dot-product

attention

units. For each attention unit, the transformer model learns three weight matrices: the query weights

{\displaystyle W_{Q}}

, the key weights

{\displaystyle W_{K}}

, and the value weights

{\displaystyle W_{V}}

. For each token

{\displaystyle i}

, the input token representation

{\displaystyle x_{i}}

is multiplied with each of the three weight matrices to produce a query vector

{\displaystyle q_{i}=x_{i}W_{Q}}

, a key vector

{\displaystyle k_{i}=x_{i}W_{K}}

, and a value vector

{\displaystyle v_{i}=x_{i}W_{V}}

. Attention weights are calculated using the query and key vectors: the attention weight

{\displaystyle a_{ij}}

from token

{\displaystyle i}

to token

{\displaystyle j}

is the

dot product

between

{\displaystyle q_{i}}

and

{\displaystyle k_{j}}

. The attention weights are divided by the square root of the dimension of the key vectors,

{\displaystyle {\sqrt {d_{k}}}}

, which stabilizes gradients during training, and passed through a

softmax

which normalizes the weights. The fact that

{\displaystyle W_{Q}}

and

{\displaystyle W_{K}}

are different matrices allows attention to be non-symmetric: if token

{\displaystyle i}

attends to token

{\displaystyle j}

(i.e.

{\displaystyle q_{i}\cdot k_{j}}

is large), this does not necessarily mean that token

{\displaystyle j}

will attend to token

{\displaystyle i}

(i.e.

{\displaystyle q_{j}\cdot k_{i}}

could be small). The output of the attention unit for token

{\displaystyle i}

is the weighted sum of the value vectors of all tokens, weighted by

{\displaystyle a_{ij}}

, the attention from token

{\displaystyle i}

to each token.

The attention calculation for all tokens can be expressed as one large matrix calculation using the

softmax function

, which is useful for training due to computational matrix operation optimizations that quickly compute matrix operations. The matrices

{\displaystyle Q}

{\displaystyle K}

and

{\displaystyle V}

are defined as the matrices where the

{\displaystyle i}

th rows are vectors

{\displaystyle q_{i}}

{\displaystyle k_{i}}

, and

{\displaystyle v_{i}}

respectively. Then we can represent the attention as

Attention

softmax

{\displaystyle {\begin{aligned}{\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}}\right)V\end{aligned}}}

where softmax is taken over the horizontal axis.

Multi-head attention

edit

One set of

{\displaystyle \left(W_{Q},W_{K},W_{V}\right)}

matrices is called an

attention head

, and each layer in a transformer model has multiple attention heads. While each attention head attends to the tokens that are relevant to each token, multiple attention heads allow the model to do this for different definitions of "relevance". In addition, the influence field representing relevance can become progressively dilated in successive layers. Many transformer attention heads encode relevance relations that are meaningful to humans. For example, some attention heads can attend mostly to the next word, while others mainly attend from verbs to their direct objects.

[40]

The computations for each attention head can be performed in

parallel

, which allows for fast processing. The outputs for the attention layer are concatenated to pass into the

feed-forward neural network

layers.

Concretely, let the multiple attention heads be indexed by

{\displaystyle i}

, then we have

MultiheadedAttention

Concat

Attention

{\displaystyle {\text{MultiheadedAttention}}(Q,K,V)={\text{Concat}}_{i\in [\#heads]}({\text{Attention}}(XW_{i}^{Q},XW_{i}^{K},XW_{i}^{V}))W^{O}}

where the matrix

{\displaystyle X}

is the concatenation of word embeddings, and the matrices

{\displaystyle W_{i}^{Q},W_{i}^{K},W_{i}^{V}}

are "projection matrices" owned by individual attention head

{\displaystyle i}

, and

{\displaystyle W^{O}}

is a final projection matrix owned by the whole multi-headed attention head.

Masked attention

edit

It may be necessary to cut out attention links between some word-pairs. For example, the decoder for token position

{\displaystyle t}

should not have access to token position

{\displaystyle t+1}

. This may be accomplished before the softmax stage by adding a mask matrix

{\displaystyle M}

that is

{\displaystyle -\infty }

at entries where the attention link must be cut, and

{\displaystyle 0}

at other places:

MaskedAttention

softmax

{\displaystyle {\begin{aligned}{\text{MaskedAttention}}(Q,K,V)={\text{softmax}}\left(M+{\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}}\right)V\end{aligned}}}

For example, the following mask matrix is used in autoregressive modeling:

{\displaystyle M={\begin{bmatrix}0&-\infty &-\infty &\dots &-\infty \\0&0&-\infty &\dots &-\infty \\0&0&0&\dots &-\infty \\\vdots &\vdots &\vdots &\ddots &\vdots \\0&0&0&\dots &0\end{bmatrix}}}

In words, it means that each token can pay attention to itself, and every token before it, but not any after it.

Encoder

edit

Each encoder consists of two major components: a self-attention mechanism and a feed-forward neural network. The self-attention mechanism accepts input encodings from the previous encoder and weights their relevance to each other to generate output encodings. The feed-forward neural network further processes each output encoding individually. These output encodings are then passed to the next encoder as its input, as well as to the decoders.

The first encoder takes positional information and

embeddings

of the input sequence as its input, rather than encodings. The positional information is necessary for the transformer to make use of the order of the sequence, because no other part of the transformer makes use of this.

[1]

The encoder is bidirectional. Attention can be placed on tokens before and after the current token. Tokens are used instead of words to account for

polysemy

A diagram of a

sinusoidal

positional encoding with parameters

10000

100

{\displaystyle N=10000,d=100}

Positional encoding

edit

A positional encoding is a fixed-size vector representation that encapsulates the relative positions of tokens within a target sequence: it provides the transformer model with information about

where

the words are in the input sequence.

The positional encoding is defined as a function of type

{\displaystyle f:\mathbb {R} \to \mathbb {R} ^{d};d\in \mathbb {Z} ,d>0}

, where

{\displaystyle d}

is a positive even

integer

. The full positional encoding as defined in the original paper is given by the equation:

sin

cos

{\displaystyle (f(t)_{2k},f(t)_{2k+1})=(\sin(\theta ),\cos(\theta ))\quad \forall k\in \{0,1,\ldots ,d/2-1\}}

where

{\displaystyle \theta ={\frac {t}{r^{k}}},r=N^{2/d}}

Here,

{\displaystyle N}

is a free parameter that should be significantly larger than the biggest

{\displaystyle k}

that would be input into the positional encoding function. In the original paper,

[1]

the authors chose

10000

{\displaystyle N=10000}

The function is in a simpler form when written as a complex function of type

{\displaystyle f:\mathbb {R} \to \mathbb {C} ^{d/2}}

{\displaystyle f(t)=\left(e^{it/r^{k}}\right)_{k=0,1,\ldots ,{\frac {d}{2}}-1}}

where

{\displaystyle r=N^{2/d}}

The main reason the authors chose this as the positional encoding function is that it allows one to perform shifts as linear transformations:

{\displaystyle f(t+\Delta t)=\mathrm {diag} (f(\Delta t))f(t)}

where

{\displaystyle \Delta t\in \mathbb {R} }

is the distance one wishes to shift. This allows the transformer to take any encoded position, and find the encoding of the position n-steps-ahead or n-steps-behind, by a matrix multiplication.

By taking a linear sum, any convolution can also be implemented as linear transformations:

{\displaystyle \sum _{j}c_{j}f(t+\Delta t_{j})=\left(\sum _{j}c_{j}\,\mathrm {diag} (f(\Delta t_{j}))\right)f(t)}

for any constants

{\displaystyle c_{j}}

. This allows the transformer to take any encoded position and find a linear sum of the encoded locations of its neighbors. This sum of encoded positions, when fed into the attention mechanism, would create attention weights on its neighbors, much like what happens in a

convolutional neural network

language model

. In the author's words, "we hypothesized it would allow the model to easily learn to attend by relative position."

In typical implementations, all operations are done over the real numbers, not the complex numbers, but since

complex multiplication can be implemented as real 2-by-2 matrix multiplication

, this is a mere notational difference.

Decoder

edit

Each decoder consists of three major components: a self-attention mechanism, an attention mechanism over the encodings, and a feed-forward neural network. The decoder functions in a similar fashion to the encoder, but an additional attention mechanism is inserted which instead draws relevant information from the encodings generated by the encoders. This mechanism can also be called the

encoder-decoder attention

[1]

[39]

Like the first encoder, the first decoder takes positional information and embeddings of the output sequence as its input, rather than encodings. The transformer must not use the current or future output to predict an output, so the output sequence must be partially masked to prevent this reverse information flow.

[1]

This allows for

autoregressive

text generation. For all attention heads, attention can't be placed on following tokens. The last decoder is followed by a final

linear transformation

and

softmax layer

, to produce the output probabilities over the vocabulary.

All members of OpenAI's

GPT

series have a decoder-only architecture.

Terminology

edit

large language models

, the terminology is somewhat different than the terminology used in the original Transformer paper:

[41]

"encoder only": full encoder, full decoder.

"encoder-decoder": full encoder, autoregressive decoder.

"decoder only": autoregressive encoder, autoregressive decoder.

Here "autoregressive" means that a mask is inserted in the attention head to zero out all attention from one token to all tokens following it, as described in the "masked attention" section.

Generally, Transformer-based language models are of two types: causal (or "autoregressive") and masked. The

GPT series

is causal and decoder only.

BERT

is masked and encoder only.

[42]

[43]

The

T5 series

is encoder-decoder, with a full encoder and autoregressive decoder.

[35]

Subsequent work

edit

Alternative activation functions

edit

The original transformer uses

ReLU

activation function

. Other activation functions were developed, such as SwiGLU.

[44]

Alternative positional encodings

edit

Transformers may use other positional encoding methods than sinusoidal.

[45]

RoPE

edit

RoPE (rotary positional embedding),

[46]

is best explained by considering a list of 2-dimensional vectors

{\displaystyle [(x_{1}^{(1)},x_{1}^{(2)}),(x_{2}^{(1)},x_{2}^{(2)}),(x_{3}^{(1)},x_{3}^{(2)}),...]}

. Now pick some angle

{\displaystyle \theta }

. Then RoPE encoding is

RoPE

cos

sin

cos

sin

cos

sin

{\displaystyle {\text{RoPE}}{\big (}x_{m}^{(1)},x_{m}^{(2)},m{\big )}={\begin{pmatrix}\cos m\theta &-\sin m\theta \\\sin m\theta &\cos m\theta \end{pmatrix}}{\begin{pmatrix}x_{m}^{(1)}\\x_{m}^{(2)}\\\end{pmatrix}}={\begin{pmatrix}x_{m}^{(1)}\cos m\theta -x_{m}^{(2)}\sin m\theta \\x_{m}^{(2)}\cos m\theta +x_{m}^{(1)}\sin m\theta \\\end{pmatrix}}}

Equivalently, if we write the 2-dimensional vectors as complex numbers

{\displaystyle z_{m}:=x_{m}^{(1)}+ix_{m}^{(2)}}

, then RoPE encoding is just multiplication by an angle:

RoPE

{\displaystyle {\text{RoPE}}{\big (}z_{m},m{\big )}=e^{im\theta }z_{m}}

For a list of

{\displaystyle 2n}

-dimensional vectors, a RoPE encoder is defined by a sequence of angles

{\displaystyle \theta ^{(1)},...,\theta ^{(n)}}

. Then the RoPE encoding is applied to each pair of coordinates.

The benefit of RoPE is that the dot-product between two vectors depends on their relative location only:

RoPE

{\displaystyle {\text{RoPE}}{\big (}x,m{\big )}^{T}{\text{RoPE}}{\big (}y,n{\big )}={\text{RoPE}}{\big (}x,m+k{\big )}^{T}{\text{RoPE}}{\big (}y,n+k{\big )}}

for any integer

{\displaystyle k}

ALiBi

edit

ALiBi (Attention with Linear Biases)

[47]

is not a

replacement

for the positional encoder on the original transformer. Instead, it is an

additional

positional encoder that is directly plugged into the attention mechanism. Specifically, the ALiBi attention mechanism is

Attention

softmax

{\displaystyle {\begin{aligned}{\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}}+sB\right)V\end{aligned}}}

Here,

{\displaystyle s}

is a real number ("scalar"), and

{\displaystyle B}

is the

linear bias

matrix defined by

{\displaystyle B={\begin{pmatrix}0&1&2&3&\cdots \\-1&0&1&2&\cdots \\-2&-1&0&1&\cdots \\-3&-2&-1&0&\cdots \\\vdots &\vdots &\vdots &\vdots &\ddots \\\end{pmatrix}}}

in other words,

{\displaystyle B_{i,j}=j-i}

ALiBi allows pretraining on short context windows, then finetuning on longer context windows. Since it is directly plugged into the attention mechanism, it can be combined with any positional encoder that is plugged into the "bottom" of the entire network (which is where the sinusoidal encoder on the original transformer, as well as RoPE and many others, are located).

Relative Position Encodings

edit

Relative Position Encodings

[48]

is similar to ALiBi, but more generic:

Attention

softmax

{\displaystyle {\begin{aligned}{\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}}+B\right)V\end{aligned}}}

where

{\displaystyle B}

is a

Toeplitz matrix

, that is,

{\displaystyle B_{i,j}=B_{i',j'}}

whenever

{\displaystyle i-j=i'-j'}

Efficient implementation

edit

FlashAttention

edit

FlashAttention

[49]

is an algorithm that implements the transformer attention mechanism efficiently on a GPU. It performs

matrix multiplications in blocks

, such that each block fits within the

cache

of a GPU, and by careful management of the blocks it minimizes data copying between GPU caches (as data movement is slow).

An improved version, FlashAttention-2,

[50]

[51]

[52]

was developed to cater to the rising demand for language models capable of handling longer context lengths. It offers enhancements in work partitioning and parallelism, enabling it to achieve up to 230 TFLOPs/s on

A100

GPUs (

FP16

BF16

), a 2x speed increase over the original FlashAttention.

Key advancements in FlashAttention-2 include the reduction of non-matmul FLOPs, improved parallelism over the sequence length dimension, better work partitioning between GPU warps, and added support for head dimensions up to 256 and multi-query attention (MQA) and grouped-query attention (GQA).

Benchmarks revealed FlashAttention-2 to be up to 2x faster than FlashAttention and up to 9x faster than a standard attention implementation in PyTorch. Future developments include optimization for new hardware like

H100

GPUs and new data types like FP8.

Multi-Query Attention

edit

Multi-Query Attention changes the multiheaded attention mechanism.

[53]

Whereas normally,

MultiheadedAttention

Concat

Attention

{\displaystyle {\text{MultiheadedAttention}}(Q,K,V)={\text{Concat}}_{i\in [\#heads]}({\text{Attention}}(XW_{i}^{Q},XW_{i}^{K},XW_{i}^{V}))W^{O}}

with Multi-Query Attention, there is just one

{\displaystyle W^{K},W^{V}}

, thus:

MultiQueryAttention

Concat

Attention

{\displaystyle {\text{MultiQueryAttention}}(Q,K,V)={\text{Concat}}_{i\in [\#heads]}({\text{Attention}}(XW_{i}^{Q},XW^{K},XW^{V}))W^{O}}

This has a neutral effect on model quality and training speed, but increases inference speed.

Speculative decoding

edit

Transformers are used in large language models for autoregressive sequence generation: generating a stream of text, one token at a time. However, in most settings, decoding from a language models is memory-bound, meaning that we have spare compute power available. Speculative decoding

[54]

[55]

uses this spare compute power by computing several tokens in parallel. Similarly to

speculative execution

in CPUs, future tokens are computed concurrently, by speculating on the value of previous tokens, and are later discarded if it turns out the speculation was incorrect.

Specifically, consider a transformer model like GPT-3 with a context window size of 512. To generate an entire context window autoregressively with greedy decoding, it must be run for 512 times, each time generating a token

512

{\displaystyle x_{1},x_{2},...,x_{512}}

. However, if we had some educated guess for the values of these tokens, we could verify all of them in parallel, in one run of the model, by checking that each

{\displaystyle x_{t}}

is indeed the token with the largest log-likelihood in the

{\displaystyle t}

-th output.

In speculative decoding, a smaller model or some other simple heuristic is used to generate a few speculative tokens that are subsequently verified by the larger model. For example, suppose a small model generated four speculative tokens:

{\displaystyle {\tilde {x_{1}}},{\tilde {x_{2}}},{\tilde {x_{3}}},{\tilde {x_{4}}}}

. These tokens are run through the larger model, and only

{\displaystyle {\tilde {x_{1}}}}

and

{\displaystyle {\tilde {x_{2}}}}

are accepted. The same run of the large model already generated a new token

{\displaystyle x_{3}}

to replace

{\displaystyle {\tilde {x_{3}}}}

, and

{\displaystyle {\tilde {x_{4}}}}

is completely discarded. The process then repeats (starting from the 4th token) until all tokens are generated.

For non-greedy decoding, similar ideas apply, except the speculative tokens are accepted or rejected stochastically, in a way that guarantees the final output distribution is the same as if speculative decoding was not used.

[54]

[56]

Sub-quadratic transformers

edit

Training transformer-based architectures can be expensive, especially for long inputs.

[57]

Alternative architectures include the Reformer (which reduces the computational load from

{\displaystyle O(N^{2})}

{\displaystyle O(N\ln N)}

[57]

), or models like ETC/BigBird (which can reduce it to

{\displaystyle O(N)}

[58]

where

{\displaystyle N}

is the length of the sequence. This is done using

locality-sensitive hashing

and reversible layers.

[59]

[60]

Ordinary transformers require a memory size that is quadratic in the size of the context window. Attention-free transformers

[61]

reduce this to a linear dependence while still retaining the advantages of a transformer by linking the key to the value.

Long Range Arena

(2020)

[62]

is a standard benchmark for comparing the behavior of transformer architectures over long inputs.

Random Feature Attention (2021)

[63]

uses

Fourier random features

cos

sin

cos

sin

{\displaystyle \varphi (x)={\frac {1}{\sqrt {D}}}[\cos \langle w_{1},x\rangle ,\sin \langle w_{1},x\rangle ,\cdots \cos \langle w_{D},x\rangle ,\sin \langle w_{D},x\rangle ]^{T}}

where

{\displaystyle w_{1},...,w_{D}}

are independent samples from the normal distribution

{\displaystyle N(0,\sigma ^{2}I)}

. This choice of parameters satisfy

{\displaystyle \mathbb {E} [\langle \varphi (x),\varphi (y)\rangle ]=e^{\frac {\|x-y\|^{2}}{2\sigma ^{2}}}}

, or

{\displaystyle e^{\langle x,y\rangle /\sigma ^{2}}=\mathbb {E} [\langle e^{\|x\|^{2}/2\sigma ^{2}}\varphi (x),e^{\|y\|^{2}/2\sigma ^{2}}\varphi (y)\rangle ]\approx \langle e^{\|x\|^{2}/2\sigma ^{2}}\varphi (x),e^{\|y\|^{2}/2\sigma ^{2}}\varphi (y)\rangle }

Consequently, the one-headed attention, with one query, can be written as

Attention

softmax

{\displaystyle {\text{Attention}}(q,K,V)={\text{softmax}}\left({\frac {qK^{\mathrm {T} }}{\sqrt {d_{k}}}}\right)V\approx {\frac {\varphi (q)^{T}\sum _{i}e^{\|k_{i}\|^{2}/2\sigma ^{2}}\varphi (k_{i})v_{i}^{T}}{\varphi (q)^{T}\sum _{i}e^{\|k_{i}\|^{2}/2\sigma ^{2}}\varphi (k_{i})}}}

where

{\displaystyle \sigma =d_{K}^{1/4}}

. Similarly for multiple queries, and for multiheaded attention.

This approximation can be computed in linear time, as we can compute the matrix

{\displaystyle \varphi (k_{i})v_{i}^{T}}

first, then multiply it with the query. In essence, we have managed to obtain a more precise version of

Attention

softmax

{\displaystyle {\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}}\right)V\approx Q(K^{T}V/{\sqrt {d_{k}}})}

Performer (2022)

[64]

uses the same Random Feature Attention, but

{\displaystyle w_{1},...,w_{D}}

are first independently sampled from the normal distribution

{\displaystyle N(0,\sigma ^{2}I)}

, then they are

Gram-Schmidt processed

Multimodality

edit

Transformers can also be used/adapted for modalities (input or output) beyond just text, usually by finding a way to "tokenize" the modality.

Vision transformers

[28]

adapt the transformer to computer vision by breaking down input images as a series of patches, turning them into vectors, and treating them like tokens in a standard transformer.

Conformer

[29]

and later

Whisper

[65]

follow the same pattern for

speech recognition

, first turning the speech signal into a

spectrogram

, which is then treated like an image, i.e. broken down into a series of patches, turned into vectors and treated like tokens in a standard transformer.

Perceivers

by Andrew Jaegle et al. (2021)

[66]

[67]

can learn from large amounts of heterogeneous data.

Regarding

image

outputs

, Peebles et al introduced a diffusion transformer (DiT) which facilitates use of the transformer architecture for

diffusion

-based image production.

[68]

Also, Google released a transformer-centric image generator called "Muse" based on parallel decoding and masked generative transformer technology.

[69]

(Transformers played a less-central role with prior image-producing technologies,

[70]

albeit still a significant one.

[71]