Skip to content

Latest commit

 

History

History
53 lines (29 loc) · 3.75 KB

1911.11423.md

File metadata and controls

53 lines (29 loc) · 3.75 KB

Single Headed Attention RNN: Stop Thinking With Your Head, Merity, 2019

Paper, Repo, Tags: #nlp, #architectures

We present a single headed attention RNN, SHA-RNN. Trained without no hyperparameter optimization, 24h in a single GPU, the attention mechanism is readily extended to large contexts with minimal computation.

We need a Moore’s Law for machine learning that encourages a minicomputer future, not a mainframe one.

Model architecture

Upgrade of AWD-LSTM in Merity et al., 2018.

  • Trainable embedding layer
  • 1+ layers of a stacked single head attention recurrent NN (SHA-RNN)
    • Single head attention
    • Modified feedforward layer, similar to Transformer layers, Boom layer
  • Softmax classifier

The attention layer assumes no sequentiality in construction. We use a single layer because a) we don't know how many layers we're supposed to use, and b) why are we putting so much work into the attention layer? Using this simplified version has no big matrix multiplications and memory management is better.

The Boom layer takes a vector, uses a matrix multiplication with GeLU activation to produce a bigger vector, then break it into N vector and sum those together, getting another vector the same size as the input.

Experiments

The work has been conducted using the enwiki8 dataset, byte-level consisting of the first 100M bytes of a Wikipedia XML dump. There's no need for preprocessing and the inclusion of Wikipedia markups has linguistic advantages.

We also use WikiText-2 and WikiText-103 datasets (Merity et al., 2017) contain lightly processed Wikipedia articles with a closed vocabulary, containing 2M and 103M words respectively.

Results and analysis

SHA-RNN on enwiki8

No state-of-the-art results.

LSTMs are forced to have their hidden state expressed primarily through recurrence, limiting their expressiveness.

What if we had this single single headed attention experiment two years earlier? What if we as a field had spent our engineering efforts and experiment tokens on fast layer normalized block sparse LSTMs?

Discussion

When comparing language models trained on datasets with differing tokenizations a conversion process has traditionally been used. Some papers have retokenized word level datasets, such as WikiText-103, using an invertible wordpiece tokenizer. The resulting perplexities are then renormalized according to the number of tokens in each test dataset.

There's a formula to convert a bits per character (BPC) model to word level perplexity. This formula assumes equal entropy across the constituent tokens and doesn't factor in information introduced during the language modelling process.

The distribution of entropy over characters is known to be non-uniform. Experiments have traditionally shown entropy concentrated on the first few characters of traditional word segmentations in character level models.

This non-uniformity is exacerbated by modern components such as the attention mechanism which can effectively copy the occurrence of previously unknown sequences when correctly context switched. This rewards models which either allow for an early context switch to occur (i.e. the first character of a token) and which break high entropy tokens into a sequence of lower entropy tokens.

For the task of language modeling I submit the SHA-RNN as an antidote to repeated and excessive Transformer exposure. Perhaps we were too quick to throw away the past era of models simply due to a new flurry of progress

SHA-RNN achieves strong results with next to no hyper-parameter tuning. Even if this architecture doesn’t catch on it still serves to show that the interaction between model performance and attention heads are not as clear as we might have guessed.