Skip to content

Commit 063575d

Browse files
authored
Create distributed_finetune.md
1 parent b171223 commit 063575d

File tree

1 file changed

+74
-0
lines changed

1 file changed

+74
-0
lines changed

doc/distributed_finetune.md

+74
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# 📢 Distributed Fine-tuning
2+
We've conducted distributed fine tune experiment on our WizardLM utilizing original Llama-X project. Give the same hyperparameter as the Fine-tuning section, we expand our experiment on multi nodes.
3+
To reproduce our experiments, we provided the steps and system configuration here.
4+
5+
## Steps
6+
We assume you have worker-0, worker-1, worker-2 which are GPU nodes to be used for training and they could ssh into each other via private key. We assume worker-0 is the master node here, which has a opened port MASTER_PORT that worker-1 and worker-2 can directly access and it has a MASTER_IP that other nodes can access.
7+
8+
In each worker, configure your enviorment using the instructions in Llama-X. Different workers should use the same absolute path in your data, output, code folder and they should be exactly the same configuration.
9+
10+
After that, we need to change the hostfile config(*/path/to/Llama-X/src/configs/hostfile*) in each node, and add each worker into it, assuming 8 GPUs on each worker:
11+
```bash
12+
worker-0 slots=8
13+
worker-1 slots=8
14+
worker-2 slots=8
15+
```
16+
17+
And since there might be some NCCL commuication problem considering the complexity of every cluster, we recommend use this config:
18+
```bash
19+
NCCL_DEBUG=INFO
20+
NCCL_ASYNC_ERROR_HANDLING=1
21+
NCCL_BUFFSIZE=2097152
22+
```
23+
The good way is to write those variable into each nodes "*.deepspeed_env*" file in your home folder. You can refer this source for how-to: [multi-node-environment-variables](https://www.deepspeed.ai/getting-started/#multi-node-environment-variables)
24+
25+
Finally, everything is set up and run this command in worker-0. Enjoy your flight!
26+
```bash
27+
deepspeed --num_gpus 8 \
28+
--num_nodes 2 \
29+
--master_addr $MASTER_IP \
30+
--master_port $MASTER_PORT \
31+
--hostfile /path/to/Llama-X/src/configs/hostfile \
32+
train_freeform.py \
33+
--model_name_or_path /path/to/llama-7B/hf \
34+
--data_path /path/to/alpaca_evol_instruct_70k.json \
35+
--output_dir /path/to/wizardlm-7B/hf/ft \
36+
--num_train_epochs 3 \
37+
--model_max_length 2048 \
38+
--per_device_train_batch_size 8 \
39+
--per_device_eval_batch_size 1 \
40+
--gradient_accumulation_steps 1 \
41+
--evaluation_strategy "no" \
42+
--save_strategy "steps" \
43+
--save_steps 800 \
44+
--save_total_limit 3 \
45+
--learning_rate 2e-5 \
46+
--warmup_steps 2 \
47+
--logging_steps 2 \
48+
--lr_scheduler_type "cosine" \
49+
--report_to "tensorboard" \
50+
--gradient_checkpointing True \
51+
--deepspeed configs/deepspeed_config.json \
52+
--fp16 True
53+
```
54+
55+
## Troubleshooting
56+
Here are some common problem we could see from the console output:
57+
1. "Call to ibv_reg_mr failed"
58+
2. "ib_plugin.c:670 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129"
59+
60+
As long as you have IB in your system, this problem could be triggered by missed configuration of ulimit. If you run your experiment in a docker container, this option could be used for unlock ulimit limitation.
61+
```bash
62+
docker ... --ulimit memlock=-1
63+
```
64+
Or you can use this solution from NCCL official document [troubleshooting.html#infiniband](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#infiniband) and make sure you login each worker as ROOT user or use root privilege to break the limitation.
65+
66+
The other issue is that you don't have IB and only have normal network card in each worker, you can use those config in *.deepspeed_env* to disable IB and use network to communicate:
67+
```bash
68+
NCCL_DEBUG=INFO
69+
NCCL_P2P_DISABLE=1
70+
NCCL_ASYNC_ERROR_HANDLING=1
71+
NCCL_IB_DISABLE=1
72+
NCCL_SOCKET_IFNAME=ens9f1
73+
```
74+
NCCL_SOCKET_IFNAME needs to be changed to your worker's actual newtwork interface name, using *ifconfig* to find out.

0 commit comments

Comments
 (0)