-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Issues: microsoft/DeepSpeed
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
GPU mem doesn't release after delete tensors in optimizer.bit16groups
#6729
opened Nov 8, 2024 by
wheresmyhair
[BUG] any clue for MFU drop?
bug
Something isn't working
training
#6727
opened Nov 8, 2024 by
SeunghyunSEO
[BUG] [ROCm] Fine-tuning DeepSeek-Coder-V2-Lite-Instruct with 8 MI300X GPUs results in c10::DistBackendError
bug
Something isn't working
rocm
AMD/ROCm/HIP issues
training
#6725
opened Nov 8, 2024 by
nikhil-tensorwave
CUBLAS_STATUS_NOT_SUPPORTED
bug
Something isn't working
training
#6723
opened Nov 7, 2024 by
niebowen666
[BUG] Zero3 for torch.compile with compiled_autograd when running LayerNorm
bug
Something isn't working
training
#6719
opened Nov 6, 2024 by
yitingw1
[BUG] DeepSpeed accuracy issue for torch.compile if activation checkpoint function not compiler disabled
bug
Something isn't working
training
#6718
opened Nov 6, 2024 by
jerrychenhf
[BUG] The problem of using Deepspeed to start training
bug
Something isn't working
training
#6715
opened Nov 5, 2024 by
sanxiaojijiaben
[BUG]Issue with Zero Optimization for Llama-2-7b Fine-Tuning on Intel GPUs
bug
Something isn't working
training
#6713
opened Nov 5, 2024 by
molang66
"__nv_bfloat162" has already been defined
install
Installation and package dependencies
windows
#6709
opened Nov 4, 2024 by
wolfljj
[REQUEST] Some questions about deepspeed sequence parallel
enhancement
New feature or request
#6708
opened Nov 4, 2024 by
yingtongxiong
[BUG] NCCL Timeout When Pre-traing "ds_train_bert_nvidia_data_bsz32k_seq512".
bug
Something isn't working
training
#6705
opened Nov 3, 2024 by
always-H
[REQUEST] Non-element-wise Optimizer Compatibility
enhancement
New feature or request
#6701
opened Nov 2, 2024 by
Triang-jyed-driung
How could I convert ZeRO-0 deepspeed weights into fp32 model checkpoint?
enhancement
New feature or request
#6699
opened Nov 1, 2024 by
liming-ai
[BUG] Universal Checkpoint Conversion: Resumed Training Behaves as If Model Initialized from Scratch
bug
Something isn't working
training
#6691
opened Oct 30, 2024 by
purefall
DeepSpeed windows install errors
install
Installation and package dependencies
windows
#6673
opened Oct 27, 2024 by
xiezhipeng-git
Previous Next
ProTip!
Type g i on any issue or pull request to go back to the issue listing page.