R-Transformer: Recurrent Neural Network Enhanced Transformer, Wang et al., 2019

Paper, Tags: #nlp, #architectures

Previous work

RNNs have two main problems:

They cannot capture very long-term dependencies because they have gradient vanishing and exploding problems.
Because they are sequential, it is difficult, if not impossible, to parallelize the computation, which increases the time complexity.

Transformer, Vaswani et al., 2017 and TCU, Bai et al., 2018:

non-recurrent sequence models
built on convolution operation and attention mechanism
allow information flow at an arbitrary length with no intermediary loss
multi-head attention: every position is directly connected to every other positions in a sequence
capture long-term dependencies

Issues with multi-head attention mechanism:

ignore local structures in sequences
loss of sequential information of positions
heavily rely on position embeddings
- which have limited effects
- require design efforts

R-Tranformer

Novel sequence learning model: multi-layered architecture built on RNNs and the standard Transformer, has the advantages of both and mitigates their drawbacks. Captures both local structures and global long-term dependencies in sequences *without any position embeddings. Code in Github.

The model consists of a stack of identical layers, each layer having 3 components organized hierarchically: LocalRNN, multi-head attention and a position-wise feedforward network.

1. LocalRNN

Local Recurrent Neural Network, reorganizes the original long sequence into many short sequences which only contain local information. Constructs a local window of fixed size for each target position.

Positions in each local window form a short sequence, from which the RNN learns a latent representation.

The locality in the sequence is explicitly captured. The local window slides along the sequence position by position, the global sequential information is also incorporated. Short-term dependencies are captured, but no long-term dependencies.

The LocalRNN takes an input sequence and outputs a sequence of hidden representations that incorporate information of local regions.

Unlike position embeddings such as in Transformer and TCN, LocalRNN fully captures the sequential information within each window.

Works similarly as 1-D CNN, where each local window is processed by convolution operations, but this operation completely ignores the sequential information of positions within local windows.

2. Multi-head attention

Sub-layer on top of the LocalRNN to capture the global long-term dependencies. Works similarly to the pooling operation in CNNs.

Multi-head attention is effective at learning long-term dependencies because it allows a direct connection between every pair of positions. Each position attents to all the positions in the past and obtains a set of attention scores used to refine its representation.

3. Position-wise fully connected feed-forward network sub-layer

Applied to each position independently and identically, transforms the features non-linearly.

Overall architecture

LocalRNN
Layer normalization
Multi head attention
Layer normalization
Feed forward
Layer normalization

Comparison with TCN

Sequential information

TCN: the locality in sequences is captured by convolution filters, but the sequential information is ignored
LocalRNN: fully incorporates it by the sequential nature of the RNN

Long-term dependencies

TCN: uses dilated convolution that operate on nonconsecutive positions. Misses considerable amount of information from a large portion of positions in each layer
R-Transformer: considers every past position and takes much more information into consideration

Comparison with Transformer

Similar long-term memorization capacities.

Locality in sequences

Transformer: captures locality very vaguely with multi-head attention that operates on all positions
R-Transformer: explicitly and effectively captures the locality in sequences with LocalRNN

Sequential information

Transformer: has position embeddings
- limited effect
- require effort to design effective position embeddings
R-Transformer: doesn't need position embeddings

Experiments

1. Pixel-by-pixel MNIST

Take a 28x28x1 image and convert it to 784x1 sequence. The rescaling makes the pixels that are connected in the origin image far apart from each other, the sequence models needs to learn very long-term dependencies.

RNNs have bad results because they cannot memorize very long-term dependencies
Transformer, TCN much better results
TCN better than Transformer suspect because Transformer cannot model locality very well
R-Transformer better than TCN

2. Polyphonic music modeilng

Includes British and American folk tunes

LSTM and TCN better than Transformer, suspect because music tunes exhibit strong local structures
R-Transformer better than TCN because TCN ignores sequential information in local structure

3. PennTreebank, language modeilng

Character-level and word-level language modeling task. 1 million words used to investigate sequence models.

Character-level language modeling task, the model is required to predict the next character given a context
Word-level language modeling: models are required to predict the next word given the contextual words

Comparison

RNN worse results
Transformer slightly better, similar as the case of music. Strong local structures
TCN better than Transformer
R-Transformer better than TCN
LSTM best results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1907.05572.md

1907.05572.md

R-Transformer: Recurrent Neural Network Enhanced Transformer, Wang et al., 2019

Paper, Tags: #nlp, #architectures

Previous work

R-Tranformer

1. LocalRNN

2. Multi-head attention

3. Position-wise fully connected feed-forward network sub-layer

Overall architecture

Comparison with TCN

Comparison with Transformer

Experiments

1. Pixel-by-pixel MNIST

2. Polyphonic music modeilng

3. PennTreebank, language modeilng

Files

1907.05572.md

Latest commit

History

1907.05572.md

File metadata and controls

R-Transformer: Recurrent Neural Network Enhanced Transformer, Wang et al., 2019

Paper, Tags: #nlp, #architectures

Previous work

R-Tranformer

1. LocalRNN

2. Multi-head attention

3. Position-wise fully connected feed-forward network sub-layer

Overall architecture

Comparison with TCN

Comparison with Transformer

Experiments

1. Pixel-by-pixel MNIST

2. Polyphonic music modeilng

3. PennTreebank, language modeilng