You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From RHELAI built from source the 2nd of July, ilab train fails on two commands because of NumPy version conflicts.
Here is the full output (after adding set -x at the beginning of ilab-training-launcher:
ilab train
WARNING: You need at least 2 GPUs to load full precision models
+ [[ 7 -lt 6 ]]
+ NPROC_PER_NODE=1
+ EFFECTIVE_BATCH_SIZE=12
+ TRAIN_DEVICE=cuda
+ SAMPLE_SIZE=5000
+ NUM_EPOCHS=10
+ CONTAINER_DEVICE=nvidia.com/gpu=all
+ CONTAINER_NAME=quay.io/ai-lab/deepspeed-trainer:latest
++ pwd
+ SDG_OUTPUT_PATH=/root/my-project
+ SAVE_SAMPLES=4999
+ TESTING_DATA_PATH=/instructlab/generated
+ TRAINING_DATA_PATH=/instructlab/generated
+ DATASET_NAME=ilab-generated
+ CONTAINER_CACHE=/instructlab/cache
++ pwd
+ WORKDIR=/root/my-project
+ PODMAN_COMMAND=("podman""run""--device""${CONTAINER_DEVICE}""--security-opt""label=disable""--entrypoint""""-v""${SDG_OUTPUT_PATH}":/instructlab "${CONTAINER_NAME}")
+ mkdir -p /root/my-project/training
+ podman run --device nvidia.com/gpu=all --security-opt label=disable --entrypoint '' -v /root/my-project:/instructlab quay.io/ai-lab/deepspeed-trainer:latest bash -c 'python /training/src/instructlab/training/ilab_to_sdg.py "/instructlab/generated" train "ilab-generated"; mv sdg_out.jsonl /instructlab/training/train.jsonl'
Converting /instructlab/generated/train_gpt-4-turbo_2024-07-04T16_26_57.jsonl
+ podman run --device nvidia.com/gpu=all --security-opt label=disable --entrypoint '' -v /root/my-project:/instructlab quay.io/ai-lab/deepspeed-trainer:latest bash -c 'python /training/src/instructlab/training/ilab_to_sdg.py "/instructlab/generated" test "ilab-generated"; mv sdg_out.jsonl /instructlab/training/test.jsonl'
Converting /instructlab/generated/test_gpt-4-turbo_2024-07-04T16_26_57.jsonl
+ podman run --device nvidia.com/gpu=all --security-opt label=disable --entrypoint '' -v /root/my-project:/instructlab quay.io/ai-lab/deepspeed-trainer:latest bash -c 'cat /training/sample-data/train_all_pruned_SDG.jsonl >> /instructlab/training/train.jsonl'
+ podman run --device nvidia.com/gpu=all --security-opt label=disable --entrypoint '' -v /root/my-project:/instructlab quay.io/ai-lab/deepspeed-trainer:latest bash -c 'python /training/src/instructlab/training/data_process.py --logging_level INFO --data_path /instructlab/training/train.jsonl --data_output_path=/instructlab/training --max_seq_len 4096 --model_name_or_path /instructlab/models/ibm/granite-7b-base'
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.
Traceback (most recent call last): File "/training/src/instructlab/training/data_process.py", line 9, in<module>
from transformers import PreTrainedTokenizer, PreTrainedTokenizerFast
File "/usr/local/lib/python3.9/site-packages/transformers/__init__.py", line 26, in<module>
from . import dependency_versions_check
File "/usr/local/lib/python3.9/site-packages/transformers/dependency_versions_check.py", line 16, in<module>
from .utils.versions import require_version, require_version_core
File "/usr/local/lib/python3.9/site-packages/transformers/utils/__init__.py", line 33, in<module>
from .generic import (
File "/usr/local/lib/python3.9/site-packages/transformers/utils/generic.py", line 465, in<module>
import torch.utils._pytree as _torch_pytree
File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 1382, in<module>
from .functional import *# noqa: F403
File "/usr/local/lib64/python3.9/site-packages/torch/functional.py", line 7, in<module>
import torch.nn.functional as F
File "/usr/local/lib64/python3.9/site-packages/torch/nn/__init__.py", line 1, in<module>
from .modules import *# noqa: F403
File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/__init__.py", line 35, in<module>
from .transformer import TransformerEncoder, TransformerDecoder, \
File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/transformer.py", line 20, in<module>
device: torch.device = torch.device(torch._C._get_default_device()), # torch.device('cpu'),
/usr/local/lib64/python3.9/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
device: torch.device = torch.device(torch._C._get_default_device()), # torch.device('cpu'),
Traceback (most recent call last):
File "/training/src/instructlab/training/data_process.py", line 13, in<module>
from instructlab.training.config import DataProcessArgs
ModuleNotFoundError: No module named 'instructlab'
+ PODMAN_COMMAND=("podman""run""--rm""-it""--device""${CONTAINER_DEVICE}""--shm-size=10g""--security-opt""label=disable""--net""host""-v""${WORKDIR}:/instructlab""--entrypoint""""-e""HF_HOME=${CONTAINER_CACHE}""${CONTAINER_NAME}")
+ mkdir -p training_output
+ podman run --rm -it --device nvidia.com/gpu=all --shm-size=10g --security-opt label=disable --net host -v /root/my-project:/instructlab --entrypoint '' -e HF_HOME=/instructlab/cache quay.io/ai-lab/deepspeed-trainer:latest torchrun --nnodes 1 --node_rank 0 --nproc_per_node 1 --rdzv_id 101 --rdzv_endpoint 0.0.0.0:8888 /training/main_ds.py --model_name_or_path /instructlab/models/ibm/granite-7b-base --data_path /instructlab/training/data.jsonl --output_dir=/instructlab/training_output --num_epochs=10 --effective_batch_size=12 --learning_rate=2e-5 --num_warmup_steps=385 --save_samples=4999 --log_level=INFO --sharding_strategy=HYBRID_SHARD --seed=19347
+ tee training_output/0.log
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.
Traceback (most recent call last): File "/usr/local/bin/torchrun", line 5, in<module>
from torch.distributed.run import main
File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 1382, in<module>
from .functional import *# noqa: F403
File "/usr/local/lib64/python3.9/site-packages/torch/functional.py", line 7, in<module>
import torch.nn.functional as F
File "/usr/local/lib64/python3.9/site-packages/torch/nn/__init__.py", line 1, in<module>
from .modules import *# noqa: F403
File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/__init__.py", line 35, in<module>
from .transformer import TransformerEncoder, TransformerDecoder, \
File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/transformer.py", line 20, in<module>
device: torch.device = torch.device(torch._C._get_default_device()), # torch.device('cpu'),
/usr/local/lib64/python3.9/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
device: torch.device = torch.device(torch._C._get_default_device()), # torch.device('cpu'),
/usr/bin/python3: can't open file '/training/main_ds.py': [Errno 2] No such file or directory[2024-07-04 19:56:29,389] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 28) of binary: /usr/bin/python3Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in <module> sys.exit(main()) File "/usr/local/lib64/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib64/python3.9/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/usr/local/lib64/python3.9/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/usr/local/lib64/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib64/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError(torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================/training/main_ds.py FAILED------------------------------------------------------------Failures: <NO_OTHER_FAILURES>------------------------------------------------------------Root Cause (first observed failure):[0]: time : 2024-07-04_19:56:29 host : host.containers.internal rank : 0 (local_rank: 0) exitcode : 2 (pid: 28) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html============================================================+ echo+ echo+ [[ -d /root/my-project/training_output/hf_format ]]+ echo 'Warning: No results were written!'Warning: No results were written!
Thanks!
The text was updated successfully, but these errors were encountered:
ivanbaldo
pushed a commit
to ivanbaldo/rhelai-dev-preview
that referenced
this issue
Jul 9, 2024
From RHELAI built from source the 2nd of July,
ilab train
fails on two commands because of NumPy version conflicts.Here is the full output (after adding
set -x
at the beginning ofilab-training-launcher
:Thanks!
The text was updated successfully, but these errors were encountered: