Spaces:
Sleeping
Sleeping
Abstract | |
The dominant sequence transduction models are based on complex recurrent or | |
convolutional neural networks that include an encoder and a decoder. The best | |
performing models also connect the encoder and decoder through an attention | |
mechanism. We propose a new simple network architecture, the Transformer, | |
based solely on attention mechanisms, dispensing with recurrence and convolutions | |
entirely. Experiments on two machine translation tasks show these models to | |
be superior in quality while being more parallelizable and requiring significantly | |
less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English- | |
to-German translation task, improving over the existing best results, including | |
ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, | |
our model establishes a new single-model state-of-the-art BLEU score of 41.0 after | |
training for 3.5 days on eight GPUs, a small fraction of the training costs of the | |
best models from the literature. | |
1 Introduction | |
Recurrent neural networks, long short-term memory [12] and gated recurrent [7] neural networks | |
in particular, have been firmly established as state of the art approaches in sequence modeling and | |
transduction problems such as language modeling and machine translation [29, 2, 5]. Numerous | |
efforts have since continued to push the boundaries of recurrent language models and encoder-decoder | |
architectures [31, 21, 13]. | |
∗Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started | |
the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and | |
has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head | |
attention and the parameter-free position representation and became the other person involved in nearly every | |
detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and | |
tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and | |
efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and | |
implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating | |
our research. | |
†Work performed while at Google Brain. | |
‡Work performed while at Google Research. | |
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. | |
Recurrent models typically factor computation along the symbol positions of the input and output | |
sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden | |
states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently | |
sequential nature precludes parallelization within training examples, which becomes critical at longer | |
sequence lengths, as memory constraints limit batching across examples. Recent work has achieved | |
significant improvements in computational efficiency through factorization tricks [18] and conditional | |
computation [26], while also improving model performance in case of the latter. The fundamental | |
constraint of sequential computation, however, remains. | |
Attention mechanisms have become an integral part of compelling sequence modeling and transduc- | |
tion models in various tasks, allowing modeling of dependencies without regard to their distance in | |
the input or output sequences [2, 16]. In all but a few cases [22], however, such attention mechanisms | |
are used in conjunction with a recurrent network. | |
In this work we propose the Transformer, a model architecture eschewing recurrence and instead | |
relying entirely on an attention mechanism to draw global dependencies between input and output. | |
The Transformer allows for significantly more parallelization and can reach a new state of the art in | |
translation quality after being trained for as little as twelve hours on eight P100 GPUs. | |
2 Background | |
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU | |
[20], ByteNet [15] and ConvS2S [8], all of which use convolutional neural networks as basic building | |
block, computing hidden representations in parallel for all input and output positions. In these models, | |
the number of operations required to relate signals from two arbitrary input or output positions grows | |
in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes | |
it more difficult to learn dependencies between distant positions [11]. In the Transformer this is | |
reduced to a constant number of operations, albeit at the cost of reduced effective resolution due | |
to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as | |
described in section 3.2. | |
Self-attention, sometimes called intra-attention is an attention mechanism relating different positions | |
of a single sequence in order to compute a representation of the sequence. Self-attention has been | |
used successfully in a variety of tasks including reading comprehension, abstractive summarization, | |
textual entailment and learning task-independent sentence representations [4, 22, 23, 19]. | |
End-to-end memory networks are based on a recurrent attention mechanism instead of sequence- | |
aligned recurrence and have been shown to perform well on simple-language question answering and | |
language modeling tasks [28]. | |
To the best of our knowledge, however, the Transformer is the first transduction model relying | |
entirely on self-attention to compute representations of its input and output without using sequence- | |
aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate | |
self-attention and discuss its advantages over models such as [14, 15] and [8]. |