Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] encounter error when running sok dlrm benchmark #461

Open
Orca-bit opened this issue Oct 21, 2024 · 3 comments
Open

[BUG] encounter error when running sok dlrm benchmark #461

Orca-bit opened this issue Oct 21, 2024 · 3 comments

Comments

@Orca-bit
Copy link

Describe the bug

  1. Create train.bin and test.bin following HugeCTR dlrm sample. md5sum is same.
  2. split data using sok preprocessing split_bin.py. replace --slot_size_array with the list in HugeCTR dlrm sample train.py. other arguments are default. is it need to chage default dtype, i.e., int32, for label_raw_type dense_raw_type and category_raw_type?
  3. horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=./splited_dataset/ --global_batch=65536 --epochs=1 --lr=24

after runing iteration 3790, some errors occur, it looks like something wrong with dataset.

[1,6]<stderr>:Traceback (most recent call last):
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/main.py", line 146, in <module>
[1,6]<stderr>:    trainer.train(eval_in_last=False, early_stop=args.early_stop, epochs=args.epochs)
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/trainer.py", line 247, in train
[1,6]<stderr>:    auc = evaluate(self._model, self._test_dataset, self._auc_thresholds)
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/trainer.py", line 20, in evaluate
[1,6]<stderr>:    for idx, (samples, labels) in enumerate(dataset):
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/dataset.py", line 152, in __getitem__
[1,6]<stderr>:    return self._prefetch_queue.get().result()
[1,6]<stderr>:  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
[1,6]<stderr>:    return self.__get_result()
[1,6]<stderr>:  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
[1,6]<stderr>:    raise self._exception
[1,6]<stderr>:  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
[1,6]<stderr>:    result = self.fn(*self.args, **self.kwargs)
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/dataset.py", line 205, in _get
[1,6]<stderr>:    tf.RaggedTensor.from_row_lengths(flat_values, row_lengths[i])
[1,6]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
[1,6]<stderr>:    raise e.with_traceback(filtered_tb) from None
[1,6]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/check_ops.py", line 485, in _binary_assert
[1,6]<stderr>:    raise errors.InvalidArgumentError(
[1,6]<stderr>:tensorflow.python.framework.errors_impl.InvalidArgumentError: Arguments to _from_row_partition do not form a valid RaggedTensor
[1,6]<stderr>:Condition x == y did not hold.
[1,6]<stderr>:First 1 elements of x:
[1,6]<stderr>:[8192]
[1,6]<stderr>:First 1 elements of y:
[1,6]<stderr>:[2]

To Reproduce
Steps to reproduce the behavior:

  1. How to build including docker pull & docker run commands
  2. How to run including the JSON config file used

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS: [e.g. Ubuntu xx.yy]
  • Graphic card: [e.g. a single NVIDIA H100]
  • CUDA version: [e.g. CUDA 11.x]
  • Docker image

Additional context
Add any other context about the problem here.

@kanghui0204
Copy link
Collaborator

Hi @Orca-bit , is this bug reproducible every time? If so, I will try to reproduce it and then provide you with an answer. Additionally, I will also test the issue mentioned at #463.

@Orca-bit
Copy link
Author

@kanghui0204 yes, it is reproducible. By the way, could you share the md5sums of sok split datasets, I have checked md5sums of the hugectr datasets, i.e. train.bin ,test.bin and val.bin.

@kanghui0204
Copy link
Collaborator

Hi @Orca-bit ,can you try to set prefetch to 0? to check if there is still have a error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants