Skip to content

Lora或full微调训练,FSDP设置导致训练loss为Nan?怎么解决? #114

@Owen1234560

Description

@Owen1234560

有人遇到同样的情况吗?
Full模式进行训练时,采用accelerate+FSDP配置,loss为Nan。将原始fsdp配置中fsdp_use_orig_params,改为False,loss不为Nan,但不收敛。
Lora模式进行训练时,采用accelerate+FSDP配置,loss为Nan。
Lora模式进行训练时,采用accelerate+DDP,loss是正常下降的。

以下是原始FSDP配置:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
fsdp_activation_checkpointing: false
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_reshard_after_forward: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_transformer_layer_cls_to_wrap: BasicAVTransformerBlock
fsdp_use_orig_params: true
fsdp_version: 1
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions