Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to finetune in single gpu #79

Open
sxsxsx opened this issue Dec 19, 2023 · 1 comment
Open

how to finetune in single gpu #79

sxsxsx opened this issue Dec 19, 2023 · 1 comment

Comments

@sxsxsx
Copy link

sxsxsx commented Dec 19, 2023

cd finetune && deepspeed finetune_deepseekcoder.py --model_name_or_path $MODEL_PATH --data_path $DATA_PATH --output_dir $OUTPUT_PATH --num_train_epochs 3 --model_max_length 1024 --per_device_train_batch_size 16 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy "no" --save_strategy "steps" --save_steps 100 --save_total_limit 100 --learning_rate 2e-5 --warmup_steps 10 --logging_steps 1 --lr_scheduler_type "cosine" --gradient_checkpointing True --report_to "tensorboard" --deepspeed configs/ds_config_zero3.json --bf16 True

[2023-12-19 16:10:57,887] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/torch/cuda/init.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
[2023-12-19 16:11:06,596] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-12-19 16:11:06,596] [INFO] [runner.py:570:main] cmd = /home/admin/miniconda3/envs/deepseek/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None finetune_deepseekcoder.py --model_name_or_path deepseek-ai/deepseek-coder-6.7b-instruct --data_path ../data/nickroshEvol-Instruct-Code-80k-v1/EvolInstruct-Code-80k.json --output_dir ./outputs --num_train_epochs 3 --model_max_length 1024 --per_device_train_batch_size 16 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 100 --save_total_limit 100 --learning_rate 2e-5 --warmup_steps 10 --logging_steps 1 --lr_scheduler_type cosine --gradient_checkpointing True --report_to tensorboard --deepspeed configs/ds_config_zero3.json --bf16 True
[2023-12-19 16:11:12,734] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/torch/cuda/init.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
[2023-12-19 16:11:16,782] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]}
[2023-12-19 16:11:16,782] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-12-19 16:11:16,782] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-12-19 16:11:16,782] [INFO] [launch.py:163:main] dist_world_size=1
[2023-12-19 16:11:16,782] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/torch/cuda/init.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
[2023-12-19 16:11:28,688] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-19 16:11:30,064] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-19 16:11:30,065] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
File "/workspace/workdir/tevs_multi_idc_10g_20220825163730/lyq/DeepSeek-Coder/finetune/finetune_deepseekcoder.py", line 193, in
train()
File "/workspace/workdir/tevs_multi_idc_10g_20220825163730/lyq/DeepSeek-Coder/finetune/finetune_deepseekcoder.py", line 123, in train
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 123, in init
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/transformers/training_args.py", line 1493, in post_init
and (self.device.type != "cuda")
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/transformers/training_args.py", line 1941, in device
return self._setup_devices
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get
cached = self.fget(obj)
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/transformers/training_args.py", line 1867, in _setup_devices
self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/accelerate/state.py", line 183, in init
dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, **kwargs)
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 670, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 120, in init
self.init_process_group(backend, timeout, init_method, rank, world_size)
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 146, in init_process_group
torch.distributed.init_process_group(backend,
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
func_return = func(*args, **kwargs)
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group
default_pg, _ = _new_process_group_helper(
File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1279, in _new_process_group_helper
backend_class = ProcessGroupNCCL(backend_prefix_store, group_rank, group_size, pg_options)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

@yh-xu
Copy link

yh-xu commented Feb 4, 2024

It seems your environment has no gpu device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants