|
| 1 | +# 📢 Distributed Fine-tuning |
| 2 | +We've conducted distributed fine tune experiment on our WizardLM utilizing original Llama-X project. Give the same hyperparameter as the Fine-tuning section, we expand our experiment on multi nodes. |
| 3 | +To reproduce our experiments, we provided the steps and system configuration here. |
| 4 | + |
| 5 | +## Steps |
| 6 | +We assume you have worker-0, worker-1, worker-2 which are GPU nodes to be used for training and they could ssh into each other via private key. We assume worker-0 is the master node here, which has a opened port MASTER_PORT that worker-1 and worker-2 can directly access and it has a MASTER_IP that other nodes can access. |
| 7 | + |
| 8 | +In each worker, configure your enviorment using the instructions in Llama-X. Different workers should use the same absolute path in your data, output, code folder and they should be exactly the same configuration. |
| 9 | + |
| 10 | +After that, we need to change the hostfile config(*/path/to/Llama-X/src/configs/hostfile*) in each node, and add each worker into it, assuming 8 GPUs on each worker: |
| 11 | +```bash |
| 12 | +worker-0 slots=8 |
| 13 | +worker-1 slots=8 |
| 14 | +worker-2 slots=8 |
| 15 | +``` |
| 16 | + |
| 17 | +And since there might be some NCCL commuication problem considering the complexity of every cluster, we recommend use this config: |
| 18 | +```bash |
| 19 | +NCCL_DEBUG=INFO |
| 20 | +NCCL_ASYNC_ERROR_HANDLING=1 |
| 21 | +NCCL_BUFFSIZE=2097152 |
| 22 | +``` |
| 23 | +The good way is to write those variable into each nodes "*.deepspeed_env*" file in your home folder. You can refer this source for how-to: [multi-node-environment-variables](https://www.deepspeed.ai/getting-started/#multi-node-environment-variables) |
| 24 | + |
| 25 | +Finally, everything is set up and run this command in worker-0. Enjoy your flight! |
| 26 | +```bash |
| 27 | +deepspeed --num_gpus 8 \ |
| 28 | + --num_nodes 2 \ |
| 29 | + --master_addr $MASTER_IP \ |
| 30 | + --master_port $MASTER_PORT \ |
| 31 | + --hostfile /path/to/Llama-X/src/configs/hostfile \ |
| 32 | + train_freeform.py \ |
| 33 | + --model_name_or_path /path/to/llama-7B/hf \ |
| 34 | + --data_path /path/to/alpaca_evol_instruct_70k.json \ |
| 35 | + --output_dir /path/to/wizardlm-7B/hf/ft \ |
| 36 | + --num_train_epochs 3 \ |
| 37 | + --model_max_length 2048 \ |
| 38 | + --per_device_train_batch_size 8 \ |
| 39 | + --per_device_eval_batch_size 1 \ |
| 40 | + --gradient_accumulation_steps 1 \ |
| 41 | + --evaluation_strategy "no" \ |
| 42 | + --save_strategy "steps" \ |
| 43 | + --save_steps 800 \ |
| 44 | + --save_total_limit 3 \ |
| 45 | + --learning_rate 2e-5 \ |
| 46 | + --warmup_steps 2 \ |
| 47 | + --logging_steps 2 \ |
| 48 | + --lr_scheduler_type "cosine" \ |
| 49 | + --report_to "tensorboard" \ |
| 50 | + --gradient_checkpointing True \ |
| 51 | + --deepspeed configs/deepspeed_config.json \ |
| 52 | + --fp16 True |
| 53 | +``` |
| 54 | + |
| 55 | +## Troubleshooting |
| 56 | +Here are some common problem we could see from the console output: |
| 57 | +1. "Call to ibv_reg_mr failed" |
| 58 | +2. "ib_plugin.c:670 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129" |
| 59 | + |
| 60 | +As long as you have IB in your system, this problem could be triggered by missed configuration of ulimit. If you run your experiment in a docker container, this option could be used for unlock ulimit limitation. |
| 61 | +```bash |
| 62 | +docker ... --ulimit memlock=-1 |
| 63 | +``` |
| 64 | +Or you can use this solution from NCCL official document [troubleshooting.html#infiniband](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#infiniband) and make sure you login each worker as ROOT user or use root privilege to break the limitation. |
| 65 | + |
| 66 | +The other issue is that you don't have IB and only have normal network card in each worker, you can use those config in *.deepspeed_env* to disable IB and use network to communicate: |
| 67 | +```bash |
| 68 | +NCCL_DEBUG=INFO |
| 69 | +NCCL_P2P_DISABLE=1 |
| 70 | +NCCL_ASYNC_ERROR_HANDLING=1 |
| 71 | +NCCL_IB_DISABLE=1 |
| 72 | +NCCL_SOCKET_IFNAME=ens9f1 |
| 73 | +``` |
| 74 | +NCCL_SOCKET_IFNAME needs to be changed to your worker's actual newtwork interface name, using *ifconfig* to find out. |
0 commit comments