Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions train/tr10-13B-ml/tr10-13B.slurm
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ GLOBAL_BATCH_SIZE=2048

NLAYERS=40
NHIDDEN=5120
NHEADS=32
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't why we chose 32? We seem to have updated the NHIDDEN value to be 5120 because it was divisible by 128, and 5120 // 128 = 40.

https://huggingface.slack.com/archives/C01NHER1JLS/p1627034738272600?thread_ts=1626827659.189400&cid=C01NHER1JLS

cc @VictorSanh @stas00 @mryab (People who were involved in the original post)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, 530B training used:

NLAYERS=105
NHIDDEN=20480
NHEADS=128

So the same proportion as 32 and 5120

Copy link
Contributor

@stas00 stas00 Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, @TevenLeScao shared elsewhere a research paper showing that many heads were found to be quite redundant anyway.

I'm not sure if there is a research showing size of the head vs. number of the heads performance.

NHEADS=40
SEQ_LEN=2048
VOCAB_SIZE=150000

Expand All @@ -57,13 +57,14 @@ OPTIMIZER_ARGS=" \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--adam-eps 1e-8 \
--lr 6e-5 \
--lr 6e-4 \
--min-lr 6e-6 \
--lr-decay-style cosine \
--lr-decay-samples 126_953_125 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you removed this one w/o any commentary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original tr1-13B said:

We need lr-decay in samples, so tokens2samples = 260B / 2048 = 126_953_125

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at setting it by default to the entire number of samples we have
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/fd1e1da967c74e598acfc011031474663ef5845e/megatron/training.py#L341

We have been using this in arch/scaling.

However I've just re-read the GPT3 paper and they do it for 260B ... so not sure here. cc @TevenLeScao

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the note, Thomas - it's crucial that we leave a note trail, otherwise we have no idea why some config was added or removed.

--lr-warmup-samples 216_320 \
--clip-grad 1.0 \
--weight-decay 1e-1 \
--hidden-dropout 0.0 \
--attention-dropout 0.0 \
Comment on lines +69 to +70
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://arxiv.org/abs/2010.11934 showed strong performance loss when using dropout (table 4). Though it was enc/dec architecture, there's probably no reason that it would benefit our dec only arch. We are currently evaluating this on 1B3 scale. https://huggingface.co/bigscience/tr3o-1B3-pile-no-dropout-logs

"

EXIT_OPTS=" \
Expand All @@ -80,7 +81,7 @@ GPT_ARGS=" \
--micro-batch-size $MICRO_BATCH_SIZE \
--rampup-batch-size 16 16 6_000_000 \
--global-batch-size $GLOBAL_BATCH_SIZE \
--train-samples 300_000_000 \
--train-samples $((3000000000 / $SEQ_LEN + 1)) \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path $TOKENIZER_NAME \
--loss-scale 12 \
Expand Down