Skip to content

Files

Latest commit

Aug 26, 2024
f4f7b27 · Aug 26, 2024

History

History
This branch is 29 commits behind microsoft/unilm:master.

YOCO

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
May 9, 2024
May 9, 2024
Aug 26, 2024
May 9, 2024
May 20, 2024

You Only Cache Once: Decoder-Decoder Architectures for Large Language Models

Approach

Performance

Harness Eval

Training with 1T Tokens:

Model Arc-c Arc-e BoolQ Hellaswag OBQA PIQA Winogrande SciQ Avg
OpenLLaMA-3B-v2 0.339 0.676 0.657 0.700 0.260 0.767 0.629 0.924 0.619
StableLM-base-alpha-3B-v2 0.324 0.673 0.646 0.686 0.264 0.760 0.621 0.921 0.612
StableLM-3B-4E1T --- 0.666 --- --- --- 0.768 0.632 0.914 ---
YOCO-3B 0.379 0.731 0.645 0.689 0.298 0.763 0.639 0.924 0.634

Training with 1.6T Tokens:

Model Arc-c Arc-e BoolQ Hellaswag OBQA PIQA Winogrande SciQ Avg
StableLM-3B-4E1T --- 0.688 --- --- --- 0.762 0.627 0.913 ---
YOCO-3B 0.396 0.733 0.644 0.698 0.300 0.764 0.631 0.921 0.636
YOCO-3B-1M 0.413 0.747 0.638 0.705 0.300 0.773 0.651 0.932 0.645

Needle In A Haystack

Multi-Needle Eval

Model Size N=1 N=2 N=4 N=8
GPT-4-128K -- 1.00 1.00 0.98 1.00
MiniCPM-128K 2.4B 1.00 1.00 0.54 0.56
ChatGLM3-128K 6B 0.94 0.72 0.52 0.44
YaRN-Mistral-128K 7B 0.02 0.12 0.08 0.20
LWM-1M-text 7B 1.00 0.90 0.76 0.62
YOCO-3B-1M 3B 0.98 0.98 0.84 0.56

Setup

To install the required packages, use the following command:

pip install -r requirements.txt

Besides normal packages, Apex and Flash-Attention should be installed seperately following their offcial guidences.

Harness Eval

To evaluate models in Harness-Eval, the script is as follows in scripts/eval_task.sh:

cd fairseq/
TASK='harness_boolq'

torchrun --master-port=29505 --nproc_per_node=1 validate.py \
    --data-dir ../harness_data/ \
    --criterion harness_eval \
    --task harness_eval \
    --batch-size 4 \
    --eval-data ${TASK}  \
    --log-format simple  --log-interval 10 \
    --bf16 \
    --tokenizer-pad-to-multiple 8 \
    --arch yoco_3b_new --tiktoken-model cl100k_base --load-ckpt /path_to_ckpt/YOCO-3B-1M/checkpoint.pth --yoco-model /path_to_ckpt/YOCO-3B-1M  --tokens-per-sample 4096

Needle In A Haystack Evaluation

Our model uses city-number pairs for long sequence evaluation. To get the results at a certain maximal length, the script is as follows in scripts/eval_needle.sh:

cd fairseq/
torchrun --master-port=29504 --nproc_per_node=1 validate.py \
    --task pseudo \
    --criterion needle_haystack \
    --batch-size 1 \
    --max-epoch 1 \
    --no-save \
    --tiktoken-model cl100k_base \
    --bf16 \
    --arch yoco_3b_new --tiktoken-model cl100k_base --load-ckpt /path_to_ckpt/YOCO-3B-1M/checkpoint.pth --yoco-model /path_to_ckpt/YOCO-3B-1M --tokens-per-sample 1048576 --interval 1048576

To run Multi-Needle experiments, replace --criterion needle_haystack with --criterion multi_needle --needle-num {num}.

Pretraining From Scratch

To support distributed training, our implementation is based on infinibatch to read data iteratively. The overall data directory should be organized as follows:

Data/
├── json/
│   ├── train.json
│   └── CC.json
│   └── StarCoder.json
│   └── ...
├── shard/
│   ├── CC/
│   │   ├── 00000.jsonl
│   │   ├── 00001.jsonl
│   │   └── ...
│   └── StarCoder/
│       ├── 00000.jsonl
│       ├── 00001.jsonl
│       └── ...

We recommend that each sharded data files contains no more than 10K lines with one json dict per line, and jsonl file, such as Data/shard/CC/00000.jsonl, should be in the format like this:

{"text": "File 1 is here..."}
{"text": "File 2 is here..."}
...

Then, for each source, a JSON file preserves all the paths of the jsonl files. Take Data/json/CC.json for example:

[
    "/path_to_data/Data/shard/CC/00000.jsonl",
    "/path_to_data/Data/shard/CC/00001.jsonl",
    ...
]

Finally, train.json records all sources' information and sampling ratio:

[
    {
        "name": "CC",
        "weight": 0.5
    },
    {
        "name": "StarCoder",
        "weight": 0.2
    },
    ...
]

scripts/train.sh:

cd fairseq/
torchrun --nproc-per-node=1 train.py /path_to_data \
    --save-interval-updates 5000 \
    --no-epoch-checkpoints \
    --arch yoco_base \
    --criterion cross_entropy \
    --task gpt \
    --tokens-per-sample 2048 \
    --tokenizer-pad-to-multiple 8 \
    --pad-to-max-len \
    --optimizer adam --adam-betas "(0.9, 0.95)" \
    --adam-eps 1e-06 \
    --clip-norm 2.0 \
    --lr 0.00015 \
    --lr-scheduler polynomial_decay \
    --warmup-updates 50 \
    --weight-decay 0.05 \
    --batch-size 1  \
    --model-parallel-size 1 \
    --update-freq 1 \
    --batch-read-ahead 1000 \
    --total-num-update 300000 \
    --log-format simple      --log-interval 10    --disable-validation \
    --tiktoken-model cl100k_base \
    --save-interval-updates 5000 \
    --bf16 # bf16 is encouraged in pre-training