how to finetune in single gpu #79

sxsxsx · 2023-12-19T16:23:01Z

cd finetune && deepspeed finetune_deepseekcoder.py --model_name_or_path $MODEL_PATH --data_path $DATA_PATH --output_dir $OUTPUT_PATH --num_train_epochs 3 --model_max_length 1024 --per_device_train_batch_size 16 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy "no" --save_strategy "steps" --save_steps 100 --save_total_limit 100 --learning_rate 2e-5 --warmup_steps 10 --logging_steps 1 --lr_scheduler_type "cosine" --gradient_checkpointing True --report_to "tensorboard" --deepspeed configs/ds_config_zero3.json --bf16 True

[2023-12-19 16:10:57,887] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/torch/cuda/init.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
[2023-12-19 16:11:06,596] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-12-19 16:11:06,596] [INFO] [runner.py:570:main] cmd = /home/admin/miniconda3/envs/deepseek/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None finetune_deepseekcoder.py --model_name_or_path deepseek-ai/deepseek-coder-6.7b-instruct --data_path ../data/nickroshEvol-Instruct-Code-80k-v1/EvolInstruct-Code-80k.json --output_dir ./outputs --num_train_epochs 3 --model_max_length 1024 --per_device_train_batch_size 16 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 100 --save_total_limit 100 --learning_rate 2e-5 --warmup_steps 10 --logging_steps 1 --lr_scheduler_type cosine --gradient_checkpointing True --report_to tensorboard --deepspeed configs/ds_config_zero3.json --bf16 True
[2023-12-19 16:11:12,734] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/torch/cuda/init.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
[2023-12-19 16:11:16,782] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]}
[2023-12-19 16:11:16,782] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-12-19 16:11:16,782] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-12-19 16:11:16,782] [INFO] [launch.py:163:main] dist_world_size=1
[2023-12-19 16:11:16,782] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/torch/cuda/init.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
[2023-12-19 16:11:28,688] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-19 16:11:30,064] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-19 16:11:30,065] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
File "/workspace/workdir/tevs_multi_idc_10g_20220825163730/lyq/DeepSeek-Coder/finetune/finetune_deepseekcoder.py", line 193, in
train()
File "/workspace/workdir/tevs_multi_idc_10g_20220825163730/lyq/DeepSeek-Coder/finetune/finetune_deepseekcoder.py", line 123, in train
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 123, in init
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/transformers/training_args.py", line 1493, in post_init
and (self.device.type != "cuda")
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/transformers/training_args.py", line 1941, in device
return self._setup_devices
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get
cached = self.fget(obj)
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/transformers/training_args.py", line 1867, in _setup_devices
self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/accelerate/state.py", line 183, in init
dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, **kwargs)
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 670, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 120, in init
self.init_process_group(backend, timeout, init_method, rank, world_size)
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 146, in init_process_group
torch.distributed.init_process_group(backend,
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
func_return = func(*args, **kwargs)
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group
default_pg, _ = _new_process_group_helper(
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1279, in _new_process_group_helper
backend_class = ProcessGroupNCCL(backend_prefix_store, group_rank, group_size, pg_options)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

yh-xu · 2024-02-04T09:09:37Z

It seems your environment has no gpu device.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to finetune in single gpu #79

how to finetune in single gpu #79

sxsxsx commented Dec 19, 2023

yh-xu commented Feb 4, 2024

how to finetune in single gpu #79

how to finetune in single gpu #79

Comments

sxsxsx commented Dec 19, 2023

yh-xu commented Feb 4, 2024