Name	Name	Last commit message	Last commit date
parent directory ..
imgs	imgs	yoco init	May 9, 2024
scripts	scripts	yoco init	May 9, 2024
yoco	yoco	fix swa	Aug 26, 2024
README.md	README.md	yoco init	May 9, 2024
requirements.txt	requirements.txt	Update requirements.txt	May 20, 2024

You Only Cache Once: Decoder-Decoder Architectures for Large Language Models

Approach

Performance

Harness Eval

Training with 1T Tokens:

Model	Arc-c	Arc-e	BoolQ	Hellaswag $^{*}$	OBQA	PIQA	Winogrande	SciQ	Avg
OpenLLaMA-3B-v2	0.339	0.676	0.657	0.700	0.260	0.767	0.629	0.924	0.619
StableLM-base-alpha-3B-v2	0.324	0.673	0.646	0.686	0.264	0.760	0.621	0.921	0.612
StableLM-3B-4E1T	---	0.666	---	---	---	0.768	0.632	0.914	---
YOCO-3B	0.379	0.731	0.645	0.689	0.298	0.763	0.639	0.924	0.634

Training with 1.6T Tokens:

Model	Arc-c	Arc-e	BoolQ	Hellaswag $^{*}$	OBQA	PIQA	Winogrande	SciQ	Avg
StableLM-3B-4E1T	---	0.688	---	---	---	0.762	0.627	0.913	---
YOCO-3B	0.396	0.733	0.644	0.698	0.300	0.764	0.631	0.921	0.636
YOCO-3B-1M	0.413	0.747	0.638	0.705	0.300	0.773	0.651	0.932	0.645

Needle In A Haystack

Multi-Needle Eval

Model	Size	N=1	N=2	N=4	N=8
GPT-4-128K	--	1.00	1.00	0.98	1.00
MiniCPM-128K	2.4B	1.00	1.00	0.54	0.56
ChatGLM3-128K	6B	0.94	0.72	0.52	0.44
YaRN-Mistral-128K	7B	0.02	0.12	0.08	0.20
LWM-1M-text	7B	1.00	0.90	0.76	0.62
YOCO-3B-1M	3B	0.98	0.98	0.84	0.56

Setup

To install the required packages, use the following command:

pip install -r requirements.txt

Besides normal packages, Apex and Flash-Attention should be installed seperately following their offcial guidences.

Harness Eval

To evaluate models in Harness-Eval, the script is as follows in scripts/eval_task.sh:

cd fairseq/
TASK='harness_boolq'

torchrun --master-port=29505 --nproc_per_node=1 validate.py \
    --data-dir ../harness_data/ \
    --criterion harness_eval \
    --task harness_eval \
    --batch-size 4 \
    --eval-data ${TASK}  \
    --log-format simple  --log-interval 10 \
    --bf16 \
    --tokenizer-pad-to-multiple 8 \
    --arch yoco_3b_new --tiktoken-model cl100k_base --load-ckpt /path_to_ckpt/YOCO-3B-1M/checkpoint.pth --yoco-model /path_to_ckpt/YOCO-3B-1M  --tokens-per-sample 4096

Needle In A Haystack Evaluation

Our model uses city-number pairs for long sequence evaluation. To get the results at a certain maximal length, the script is as follows in scripts/eval_needle.sh:

cd fairseq/
torchrun --master-port=29504 --nproc_per_node=1 validate.py \
    --task pseudo \
    --criterion needle_haystack \
    --batch-size 1 \
    --max-epoch 1 \
    --no-save \
    --tiktoken-model cl100k_base \
    --bf16 \
    --arch yoco_3b_new --tiktoken-model cl100k_base --load-ckpt /path_to_ckpt/YOCO-3B-1M/checkpoint.pth --yoco-model /path_to_ckpt/YOCO-3B-1M --tokens-per-sample 1048576 --interval 1048576

To run Multi-Needle experiments, replace --criterion needle_haystack with --criterion multi_needle --needle-num {num}.

Pretraining From Scratch

To support distributed training, our implementation is based on infinibatch to read data iteratively. The overall data directory should be organized as follows:

Data/
├── json/
│   ├── train.json
│   └── CC.json
│   └── StarCoder.json
│   └── ...
├── shard/
│   ├── CC/
│   │   ├── 00000.jsonl
│   │   ├── 00001.jsonl
│   │   └── ...
│   └── StarCoder/
│       ├── 00000.jsonl
│       ├── 00001.jsonl
│       └── ...

We recommend that each sharded data files contains no more than 10K lines with one json dict per line, and jsonl file, such as Data/shard/CC/00000.jsonl, should be in the format like this:

{"text": "File 1 is here..."}
{"text": "File 2 is here..."}
...

Then, for each source, a JSON file preserves all the paths of the jsonl files. Take Data/json/CC.json for example:

[
    "/path_to_data/Data/shard/CC/00000.jsonl",
    "/path_to_data/Data/shard/CC/00001.jsonl",
    ...
]

Finally, train.json records all sources' information and sampling ratio:

[
    {
        "name": "CC",
        "weight": 0.5
    },
    {
        "name": "StarCoder",
        "weight": 0.2
    },
    ...
]

scripts/train.sh:

cd fairseq/
torchrun --nproc-per-node=1 train.py /path_to_data \
    --save-interval-updates 5000 \
    --no-epoch-checkpoints \
    --arch yoco_base \
    --criterion cross_entropy \
    --task gpt \
    --tokens-per-sample 2048 \
    --tokenizer-pad-to-multiple 8 \
    --pad-to-max-len \
    --optimizer adam --adam-betas "(0.9, 0.95)" \
    --adam-eps 1e-06 \
    --clip-norm 2.0 \
    --lr 0.00015 \
    --lr-scheduler polynomial_decay \
    --warmup-updates 50 \
    --weight-decay 0.05 \
    --batch-size 1  \
    --model-parallel-size 1 \
    --update-freq 1 \
    --batch-read-ahead 1000 \
    --total-num-update 300000 \
    --log-format simple      --log-interval 10    --disable-validation \
    --tiktoken-model cl100k_base \
    --save-interval-updates 5000 \
    --bf16 # bf16 is encouraged in pre-training

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

YOCO

YOCO

README.md

You Only Cache Once: Decoder-Decoder Architectures for Large Language Models

Approach

Performance

Harness Eval

Needle In A Haystack

Multi-Needle Eval

Setup

Harness Eval

Needle In A Haystack Evaluation

Pretraining From Scratch

Files

YOCO

Directory actions

More options

Directory actions

More options

Latest commit

History

YOCO

Folders and files

parent directory

README.md

You Only Cache Once: Decoder-Decoder Architectures for Large Language Models

Approach

Performance

Harness Eval

Needle In A Haystack

Multi-Needle Eval

Setup

Harness Eval

Needle In A Haystack Evaluation

Pretraining From Scratch