You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hardware: 22GB GPU
Input Length: MAX_LENGTH=10154 (because the model takes query, answer, and chunks as input).
Dataset: ~2K pairs.
Problem:
Training is extremely slow—around 1 minute per step, meaning 1000 steps take ~16 hours.
I expected slowness due to large input lengths, but this seems excessive.
Am I overlooking something? Any tips on improving training speed without exceeding memory limits?
Batch Size: Had to reduce per_device_train_batch_size=1 due to CUDA OOM errors. Also reduced LoRA Settings to r=64, lora_alpha=16.
The text was updated successfully, but these errors were encountered:
I’m fine-tuning a 7B RAG LLM and running into some issues with training speed and CUDA memory constraints. Here are my training parameters:
training_args = TrainingArguments(
output_dir="models/fine_tuned_2001",
overwrite_output_dir=True,
num_train_epochs=3,
warmup_steps=20,
logging_strategy="steps",
logging_steps=10,
evaluation_strategy="no",
optim="adamw_torch",
gradient_accumulation_steps=4,
save_steps=100,
save_total_limit=2,
learning_rate=1e-5,
per_device_train_batch_size=1,
max_steps=1000,
report_to="wandb"
)
Setup & Issues:
Hardware: 22GB GPU
Input Length: MAX_LENGTH=10154 (because the model takes query, answer, and chunks as input).
Dataset: ~2K pairs.
Problem:
Training is extremely slow—around 1 minute per step, meaning 1000 steps take ~16 hours.
I expected slowness due to large input lengths, but this seems excessive.
Am I overlooking something? Any tips on improving training speed without exceeding memory limits?
Batch Size: Had to reduce per_device_train_batch_size=1 due to CUDA OOM errors. Also reduced LoRA Settings to r=64, lora_alpha=16.
The text was updated successfully, but these errors were encountered: