-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama3 conversion scripts 🦙 #174
base: main
Are you sure you want to change the base?
Conversation
@TJ-Solergibert. Thanks for the PR, have you tried continually pretraining or finetuning a llama3-converted checkpoint to Nanotron? I encountered some exploding gradient issues in my experience (not in your PR) |
Hi @xrsrke , After your comments about exploding gradient issues I've run the following:
So I haven't experienced any problem, let me know if I should look into anything more! Toni PD: We could upload Nanotron Llama3 checkpoints to the Hub, right? |
Nice PR, when loading llama3 from HF to nanotron, I had to change the rotary embedding (31c12e8) otherwise the generation was not good |
Hi, I just took care of the "training case". As you can see, there are 2 RotaryEmbedding layers: self.rotary_embedding & self.flash_rotary_embedding. The first one is just used in the "inference case", while the last is just for the training case. The interleaved thing is just for the For training, the |
src/nanotron/config/config.py
Outdated
run: Name of the run | ||
step: Global step (updated when we save the checkpoint) | ||
consumed_train_samples: Number of samples consumed during training (should be actually just step*batch_size) | ||
ignore_sanity_checks: Whether to ignore sanity checks | ||
""" | ||
|
||
project: str | ||
entity: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
entity: Optional[str] = None | |
wandb_entity: Optional[str] = None |
run_train.py
Outdated
@@ -143,17 +143,17 @@ def get_dataloader_from_data_stage( | |||
elif isinstance(data.dataset, NanosetDatasetsArgs): | |||
# Get tokenizer cardinality | |||
tokenizer = AutoTokenizer.from_pretrained(trainer.config.tokenizer.tokenizer_name_or_path) | |||
token_dtype = np.int32 if len(tokenizer) > np.iinfo(np.uint16).max + 1 else np.uint16 | |||
token_size = 4 if len(tokenizer) > np.iinfo(np.uint16).max + 1 else 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a config option to specify the byte size of a token instead of this?
src/nanotron/data/collator.py
Outdated
|
||
|
||
@dataclasses.dataclass | ||
class NanosetDataCollatorForCLM: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not reuse DataCollatorForCLM ?
src/nanotron/trainer.py
Outdated
@@ -276,7 +276,8 @@ def pre_training(self, *args, **kwargs): | |||
if dist.get_rank(self.parallel_context.world_pg) == self.logger_ranks[0] and wandb is not None: | |||
wandb.init( | |||
project=self.config.general.project, | |||
name=f"{current_time}_{self.config.general.run}", | |||
name=f"{current_time}_{self.config.general.project}_{self.config.general.run}", | |||
entity=self.config.general.entity, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we change to wandb_entity
to forget to change here as well
eb68e41
to
3e169c5
Compare
Sorry, there were 68 commits that IDK how they ended up here 😅. All your comments are respect those commits. The conflicts are related to the Let me know if there is still any issue! |
add some instructions for downloading the weights? |
da6281d
to
0afd7b7
Compare
Hi @hjc3613, from my first comment:
Toni |
thanks for your explanation. actually, I want to train llama 70B into 1.58bit model using this method: https://github.com/huggingface/nanotron/pull/180/files |
Hello,
In this PR, I include the scripts to convert the checkpoints of Llama3 8B & 70B to Nanotron. Although there are still some details to be polished, the current status is as follows:
All conversions are carried out in
BFLOAT16
and on the CPU, but we will need at least one GPU because the ParallelContext requires it. The 8B model fits on a GPU with 80GB, but the 70B model does not. Even so, in ALL conversions, we will set DP=PP=TP=1. I have confirmed that Nanotron supports changing the TP topology, although while waiting for GPUs in my cluster, I developed a fancy script with broadcast, scatter, and gathers to perform the conversion with TP>1. I have also tried a dummy Finetune with TP=2 from the TP=1 8B converted checkpoint to store it back with TP=2, checked the results in Nanotron (correct, results below), and then converted it back to HF with the result still being correct. I have attempted to experiment with all possible cases I think.Included
convert_hf_to_nanotron.py
to convert the weights from the HF checkpoint to Nanotronconvert_nanotron_to_hf.py
to convert the weights from the Nanotron checkpoint to HFgenerate_hf_predictions.py
to test the logits of the HF model with a promptgenerate_nanotron_predictions.py
to test the logits of the Nanotron model with a promptResults & Precision
It is impossible for the two models (HF & Nanotron) to produce exactly the same logits with a level of precision capable of passing the
assert_close
test. This is true both at the model level and at the layer level because, despite having the same parameters, the two models perform different operations. Different in the sense of...qkv_proj
and the projections are computed with a single GEMM, whereas in the HF implementation, it is done in three (Even in the Meta model, it is done the same way, although they also have TensorParallelLayers). By changing the shape of the matrices, the result is different because, in the GEMM, the order of operations is non-deterministic, and in reduced 16-bit types, the difference becomes more noticeable when accumulating the result. The same happens in the MLP layer withgate_up_proj
.TritonRMSNorm
), which produce results that are not exactly the same as those of the HF implementation.I have a (somewhat catastrophic) notebook where the differences at each operation level are evident. But what is really important is not so much the logits but the predictions and their order. To verify this, I developed the
generate_XXXX.py
scripts that, from the same prompt for the desired tokens, print the 10 most probable predictions and print an accuracy value of all the sequence. I chose a fixed prompt to 1. Ensure manually that the predictions makes sense 2. Compare through the different converted models. The following table shows the accuracy results for different configurations.It is worth noting that:
AutoModelForCausalLM.from_pretrained()
there is NO tensor parallelism, while in Nanotron there is.Details
This PR is build with #168 FA2 kernel, which is the same as in the HF implementation.
After extensive reverse engineering, I found a critical point that was significantly different from the HuggingFace implementation: RoPE. After numerous tests, even transferring the RoPE from the HF implementation, it turns out that there are 2 fundamental parameters of the
FlashRotaryEmbedding
layer:interleaved
: The default value in Nanotron isTrue
, but it must beFalse
.rope_theta
: The default value is10000.0
, but for Llama3, it is500000.0
.I have included both values in LlamaConfig, with the OLD values as defaults, although I propose at least changing the
interleaved
default toFalse
.In essence, to perform the conversions, we initialize the two implementations (HuggingFace & Nanotron) and copy the parameters layer by layer. After trying several methods to copy the weights, I opted for the
copy_
method, because this way we preserve theShardedInfo
&TiedInfo
of all theNanotronParameter
s.The conversion from HF to Nanotron is fast, taking 2 and 16 minutes for the 8B and 70B models respectively. However, the conversion from Nanotron to HF extends to 5 and 51 minutes respectively. This is due to the initialization of the HF model (
AutoModelForCausalLM.from_config()
).When converting to Nanotron, we also store the tokenizer (as in the HF models) and generate a
config.yaml
with the basic configurations and parameters to start training from the checkpoint. Additionally, the conversions include assertions on all parameters to ensure that we are copying the parameters correctly and making the process as explicit as possible for the conversion of future models.TODO
torch.no_grad()
in conversionslog_rank
of Nanotron was not working correctlyAddpush_to_hub
flag in the Nanotron to HF conversion scriptInstructions
In the header of all the files there are instructions, I recommend the following commands to launch the evaluations and conversions.