We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi, for example I am training a job using this yaml, how to do continue training if this job failed? Thanks.
The text was updated successfully, but these errors were encountered:
You can add a load path as a trainer argument in that yaml to resume a job from an earlier checkpoint.
Something like this:
trainer: _target_: composer.Trainer device: gpu max_duration: 850000ba eval_interval: 10000ba device_train_microbatch_size: 16 run_name: ${name} seed: ${seed} load_path: # Path to checkpoint to resume training from save_folder: # Insert path to save folder or bucket save_interval: 10000ba save_overwrite: true autoresume: false fsdp_config: sharding_strategy: "SHARD_GRAD_OP"
Sorry, something went wrong.
No branches or pull requests
Hi, for example I am training a job using this yaml, how to do continue training if this job failed? Thanks.
The text was updated successfully, but these errors were encountered: