[Bug] Error occurs while running the train.py in the tools: _pickle.UnpicklingError: pickle data was truncated #71

Mintinson · 2024-09-06T04:23:19Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

System environment:
sys.platform: linux
Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 793778121
GPU 0: NVIDIA A100-PCIE-40GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.3, V11.3.58
GCC: gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0
PyTorch: 1.11.0
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.2
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.12.0
OpenCV: 4.10.0
MMEngine: 0.10.4

Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: 793778121
Distributed launcher: none
Distributed training: False
GPU number: 1

Reproduces the problem - code sample

python tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py --work-dir=work_dirs/mv-3ddet

Reproduces the problem - command or script

python tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py --work-dir=work_dirs/mv-3ddet

Reproduces the problem - error message

09/06 03:16:31 - mmengine - WARNING - Failed to search registry with scope "embodiedscan" in the "loop" registry tree. As a workaround, the current "loop" registry in "mmengine" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "embodiedscan" is a correct scope, or whether the registry is initialized.
09/06 03:16:31 - mmengine - WARNING - euler-depth is not a meta file, simply parsed as meta information
Traceback (most recent call last):
  File "tools/train.py", line 133, in <module>
    main()
  File "tools/train.py", line 129, in main
    runner.train()
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1728, in train
    self._train_loop = self.build_train_loop(
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1520, in build_train_loop
    loop = LOOPS.build(
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/loops.py", line 44, in __init__
    super().__init__(runner, dataloader)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/base_loop.py", line 26, in __init__
    self.dataloader = runner.build_dataloader(
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1370, in build_dataloader
    dataset = DATASETS.build(dataset_cfg)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/dataset/dataset_wrapper.py", line 223, in __init__
    self.dataset = DATASETS.build(dataset)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/root/wwf/EmbodiedScan/embodiedscan/datasets/embodiedscan_dataset.py", line 59, in __init__
    super().__init__(ann_file=ann_file,
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/dataset/base_dataset.py", line 247, in __init__
    self.full_init()
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/dataset/base_dataset.py", line 298, in full_init
    self.data_list = self.load_data_list()
  File "/root/wwf/EmbodiedScan/embodiedscan/datasets/embodiedscan_dataset.py", line 342, in load_data_list
    data_info = self.parse_data_info(raw_data_info)
  File "/root/wwf/EmbodiedScan/embodiedscan/datasets/embodiedscan_dataset.py", line 147, in parse_data_info
    info['ann_info'] = self.parse_ann_info(info)
  File "/root/wwf/EmbodiedScan/embodiedscan/datasets/embodiedscan_dataset.py", line 238, in parse_ann_info
    occ_masks = mmengine.load(mask_filename)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/fileio/io.py", line 856, in load
    obj = handler.load_from_fileobj(f, **kwargs)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/fileio/handlers/pickle_handler.py", line 12, in load_from_fileobj
    return pickle.load(file, **kwargs)
_pickle.UnpicklingError: pickle data was truncated

Additional information

No response

The text was updated successfully, but these errors were encountered:

mxh1999 · 2024-09-06T12:20:50Z

It looks like the annotation file you downloaded is broken, try downloading it again.

Mintinson · 2024-09-06T14:37:55Z

Thanks for your answer!

I re-downloaded the dataset you guys placed on Google Drive and also re-ran the script extract_occupancy_ann.py and it shows that everything is fine. But it still reports the same error when training.

I noticed that the README under the data folder shows json files starting with embodiedscan_infos, while the data extracted on Google Drive starts with embodiedscan, does this matter? Do I have to change these filenames?

By the way, I would also like to know if this warning is normal? If not, what should I do to get rid of it.

09/06 03:16:31 - mmengine - Warning - Failed to search the “loop” registry tree for registries in the range “embodiedscan”. As a workaround, the current “loop” registry in “mmengine” is used to build the instance. This may cause unexpected failures when running the built module. Please check that “embodiedscan” is the correct scope, or that the registry is initialized.
09/06 03:16:31 - mmengine - Warning - euler-depth is not a metafile, just parsed as meta-information

mxh1999 · 2024-09-06T15:21:55Z

@Mintinson
Could you please provide the sample_idx of this scene?
Just replace

occ_masks = mmengine.load(mask_filename)

with

try:
    occ_masks = mmengine.load(mask_filename)
except:
    print(info['sample_idx'])
    raise ValueError

This helps us to localize the problem.

Mintinson · 2024-09-06T15:54:29Z

Here is the output:

scannet/scene0031_00
Traceback (most recent call last):
 ...

and here is the structure of the corresponding scene:

location: data/scannet/scans/scene0031_00/

scene0031_00
├── occupancy
│   ├── occupancy.npy
│   └── visible_occupancy.pkl
├── scene0031_00_2d-instance-filt.zip
├── scene0031_00_2d-instance.zip
├── scene0031_00_2d-label-filt.zip
├── scene0031_00_2d-label.zip
├── scene0031_00.aggregation.json
├── scene0031_00.sens
├── scene0031_00.txt
├── scene0031_00_vh_clean_2.0.010000.segs.json
├── scene0031_00_vh_clean_2.labels.ply
├── scene0031_00_vh_clean_2.ply
├── scene0031_00_vh_clean.aggregation.json
├── scene0031_00_vh_clean.ply
└── scene0031_00_vh_clean.segs.json

1 directory, 15 files

location: data/scannet/scans/posed_images/scene0031_00/

scene0031_00
├── 00000.jpg
├── 00000.png
├── 00000.txt
├── 00010.jpg
├── ...
├── 02750.txt
├── depth_intrinsic.txt
├── intrinsic.txt

location: data/embodiedscan_occupancy/scannet/scene0031_00/

scene0031_00
├── occupancy.npy
├── visible_occupancy.pkl

mxh1999 · 2024-09-06T16:38:39Z

@Mintinson
Could you please check the the sha256 hash values of visible_occupancy.pkl and occupancy.npy?
The hash of visible_occupancy.pkl is 405f14770ab2126e24282977d5f897d1b35569bfea3f60431d63351def49ef3a and the hash of occupancy.npy is da1b32fd3753626401446669f6df3edd3530783e784a5edee01e56c78eb6b5d1.

Mintinson · 2024-09-07T01:02:33Z

Thank you so much for your help! I checked the hash value of visible_occupancy.pkl and found that it was indeed different from the visible_occupancy.pkl hash value within embodiedscan_occupancy, I deleted the occupancy folder in raw data and ran the script again:

python embodiedscan/converter/extract_occupancy_ann.py --src data/embodiedscan_occupancy --dst data

This time the file has the correct hash value! I'm not sure what went wrong the first time I extracted these annotations. But now train.py is able to allow it without reporting errors!

I would like to ask how much memory this project needs to run, when I run train.py it gets killed because of out of memory.

mxh1999 · 2024-09-07T07:27:13Z

The memory problem is caused by the design of mmengine dataloader which will copy annotation files num_gpu * num_workers times. We are trying to fix this problem.

For a quick solution, you can see #29 for detail.

Mintinson · 2024-09-07T08:36:55Z

I tried the above solution but it didn't work. I am wondering if 125 G of RAM is enough? Do I need more RAM so that I am able to replace my server earlier?

mxh1999 · 2024-09-07T09:52:04Z

It usually costs ~140G RAM on my server. Maybe you can try setting fewer dataloader workers in config?

Mintinson · 2024-09-08T09:01:30Z

I will try that. Thank you for your timely help~

Mintinson · 2024-09-10T09:06:05Z

I would like to ask why this project is taking up so much RAM, all the projects I have done before have taken up less than 30G of memory on loading data, why is this reaching hundreds. Also, what are the GPU memory requirements for this project? So that I can allocate the hardware resources in time.

mxh1999 · 2024-09-10T10:55:27Z

I apologize for the RAM memory problem. We are working on fixing it.
For GPU memory, the default setting of Embodiedscan Detection Task like mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py requires ~20G GPU memory. It can be further reduced by decreasing batch size.

PS: The default setting totally uses ~600G RAM. I'm sorry for the previous incorrect response.

mxh1999 self-assigned this Sep 6, 2024

mxh1999 mentioned this issue Sep 11, 2024

[Feature] Preliminary Support for ArkitScenes Dataset #75

Merged

henryzhengr mentioned this issue Sep 24, 2024

[Bug] the code take a long time to start running #78

Closed

3 tasks

Tai-Wang closed this as completed Sep 27, 2024

Tai-Wang reopened this Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Error occurs while running the train.py in the tools: _pickle.UnpicklingError: pickle data was truncated #71

[Bug] Error occurs while running the train.py in the tools: _pickle.UnpicklingError: pickle data was truncated #71

Mintinson commented Sep 6, 2024

mxh1999 commented Sep 6, 2024

Mintinson commented Sep 6, 2024

mxh1999 commented Sep 6, 2024

Mintinson commented Sep 6, 2024

mxh1999 commented Sep 6, 2024 •

edited

Loading

Mintinson commented Sep 7, 2024

mxh1999 commented Sep 7, 2024 •

edited

Loading

Mintinson commented Sep 7, 2024

mxh1999 commented Sep 7, 2024

Mintinson commented Sep 8, 2024

Mintinson commented Sep 10, 2024

mxh1999 commented Sep 10, 2024 •

edited

Loading

[Bug] Error occurs while running the train.py in the tools: _pickle.UnpicklingError: pickle data was truncated #71

[Bug] Error occurs while running the train.py in the tools: _pickle.UnpicklingError: pickle data was truncated #71

Comments

Mintinson commented Sep 6, 2024

Prerequisite

Task

Branch

Environment

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 793778121 Distributed launcher: none Distributed training: False GPU number: 1

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information

mxh1999 commented Sep 6, 2024

Mintinson commented Sep 6, 2024

mxh1999 commented Sep 6, 2024

Mintinson commented Sep 6, 2024

mxh1999 commented Sep 6, 2024 • edited Loading

Mintinson commented Sep 7, 2024

mxh1999 commented Sep 7, 2024 • edited Loading

Mintinson commented Sep 7, 2024

mxh1999 commented Sep 7, 2024

Mintinson commented Sep 8, 2024

Mintinson commented Sep 10, 2024

mxh1999 commented Sep 10, 2024 • edited Loading

Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: 793778121
Distributed launcher: none
Distributed training: False
GPU number: 1

mxh1999 commented Sep 6, 2024 •

edited

Loading

mxh1999 commented Sep 7, 2024 •

edited

Loading

mxh1999 commented Sep 10, 2024 •

edited

Loading