-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resuming training for text-to-spec model does not resume with appropriate losses #534
Comments
A more detailed view of the overlap between first run and first resume. It seems, again, that the
|
While tracing the code while resuming from a checkpoint, we should be getting this warning because
if self.global_step % expected_steps != 0 and not is_resumable_loader:
rank_zero_warn(
"You're resuming from a checkpoint that ended before the epoch ended and your dataloader is"
" not resumable. This can cause unreliable results if further training is done."
" Consider using an end-of-epoch checkpoint or make your dataloader resumable by implementing"
" the `state_dict` / `load_state_dict` interface.",
category=PossibleUserWarning,
) Update: |
Test Code#!/bin/bash
# vim:nowrap:
#
#SBATCH --job-name=resume-EV
#SBATCH --partition=gpu_a100
#SBATCH --account=nrc_ict__gpu_a100
#SBATCH --gres=gpu:1
##SBATCH --partition=standard
##SBATCH --account=nrc_ict
#SBATCH --qos=low
#SBATCH --time=720
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=16G
##SBATCH --comment="image=nrc/nrc_all_default_ubuntu-22.04-amd64_latest"
##SBATCH --comment="image=registry.maze-c.collab.science.gc.ca/sschpcs/generic-job:ubuntu22.04_master"
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err
#SBATCH --open-mode=append
##SBATCH --mail-user==sam037
##SBATCH --mail-type=NONE
#SBATCH --signal=B:15@30
## Priority:
#$ low
## Resources:
#
#
function downsize {
# Given a new-project, change some default values to make everyvoice faster to test.
sed \
-i \
-e '/use_postnet/ s/: .*/: false/' \
-e '/batch_size/ s/: .*/: 5/' \
-e '/save_top_k_ckpts/ s/: .*/: 25/' \
-e '/check_val_every_n_epoch/ s/: .*/: 1/' \
-e '/val_check_interval/ s/: .*/: 1.0/' \
config/*
}
ulimit -v unlimited
readonly step_size=1
readonly version=${1:?logger version name?}
# Load up EveryVoice's environment.
source /space/partner/nrc/work/dt/sgile/opt/miniconda3/bin/activate ""
conda activate EveryVoice.sl
ulimit -v unlimited
readonly step_size=1
readonly version=${1:?logger version name?}
# Load up EveryVoice's environment.
source /space/partner/nrc/work/dt/sgile/opt/miniconda3/bin/activate ""
conda activate EveryVoice.sl
{
# Log some information for post-run debugging.
head -n 31231231 $0 # Log this script in case we make modificatons between runs
( set -o posix; set ) >&2 # What was the shell environment like?
conda env export # What was the conda environment like?
#downsize
head -n 12312312 config/*.yaml # What was the configurations that were used?
( cd ~/sam037/git/EveryVoice && git diff main..HEAD; )
} &> $SLURM_JOB_NAME-$SLURM_JOB_ID.bookkeeping
echo "========== PREPROCESSING ==========" >&2
[[ -s preprocessed/filelist.psv ]] ||
srun everyvoice preprocess config/everyvoice-text-to-spec.yaml --cpus $SLURM_CPUS_PER_TASK
echo "========== RESUMING TRAINING FEATURE PREDICTION ==========" >&2
set -o errexit
# --config-args training.attn_bin_loss_weight=0
# NOTE: if the finetune_checkpoint doesn't exist, training starts from scratch.
#for current_epoch in {0..15..3}; do
for current_epoch in {0..5..1}; do
# TODO: record each run into its own log + global log
previous_epoch=$(($current_epoch - 1))
echo "Training up to $current_epoch"
model_ckpt="logs_and_checkpoints/FeaturePredictionExperiment/$version/checkpoints/last.ckpt.$previous_epoch"
sha1sum $model_ckpt || true
srun everyvoice train text-to-spec \
config/everyvoice-text-to-spec.yaml \
--config-args training.finetune_checkpoint="$model_ckpt" \
--config-args training.max_epochs=$((current_epoch+$step_size)) \
--config-args training.logger.version=$version
mv "logs_and_checkpoints/FeaturePredictionExperiment/$version/metrics.csv" \
"logs_and_checkpoints/FeaturePredictionExperiment/$version/metrics.csv.$current_epoch" \
|| true
ls -l logs_and_checkpoints/FeaturePredictionExperiment/$version/checkpoints/*
cp "logs_and_checkpoints/FeaturePredictionExperiment/$version/checkpoints/last.ckpt" \
"logs_and_checkpoints/FeaturePredictionExperiment/$version/checkpoints/last.ckpt.$current_epoch"
sha1sum logs_and_checkpoints/FeaturePredictionExperiment/$version/checkpoints/*
done |
Let's make sure we always save at the end of a training epoch by adding # This callback will always save the last checkpoint
# regardless of its performance.
last_ckpt_callback = ModelCheckpoint(
save_top_k=1,
save_last=True,
every_n_train_steps=config.training.ckpt_steps,
every_n_epochs=config.training.ckpt_epochs,
enable_version_counter=True,
save_on_train_epoch_end=True,
)
|
To recreate: train a text-to-spec model, then add a
finetune_ckpt
value and resume training (you might need to change another model value like the max number of steps). the losses do not resume properly.Maybe related to #473
#419
The text was updated successfully, but these errors were encountered: