Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Error occurs while running the train.py in the tools: _pickle.UnpicklingError: pickle data was truncated #71

Open
3 tasks done
Mintinson opened this issue Sep 6, 2024 · 12 comments
Assignees

Comments

@Mintinson
Copy link

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment


System environment:
sys.platform: linux
Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 793778121
GPU 0: NVIDIA A100-PCIE-40GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.3, V11.3.58
GCC: gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0
PyTorch: 1.11.0
PyTorch compiling details: PyTorch built with:

  • GCC 7.3

  • C++ Version: 201402

  • Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications

  • Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)

  • OpenMP 201511 (a.k.a. OpenMP 4.5)

  • LAPACK is enabled (usually provided by MKL)

  • NNPACK is enabled

  • CPU capability usage: AVX2

  • CUDA Runtime 11.3

  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37

  • CuDNN 8.2

  • Magma 2.5.2

  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

    TorchVision: 0.12.0
    OpenCV: 4.10.0
    MMEngine: 0.10.4

Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: 793778121
Distributed launcher: none
Distributed training: False
GPU number: 1

Reproduces the problem - code sample

python tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py --work-dir=work_dirs/mv-3ddet

Reproduces the problem - command or script

python tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py --work-dir=work_dirs/mv-3ddet

Reproduces the problem - error message

09/06 03:16:31 - mmengine - WARNING - Failed to search registry with scope "embodiedscan" in the "loop" registry tree. As a workaround, the current "loop" registry in "mmengine" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "embodiedscan" is a correct scope, or whether the registry is initialized.
09/06 03:16:31 - mmengine - WARNING - euler-depth is not a meta file, simply parsed as meta information
Traceback (most recent call last):
  File "tools/train.py", line 133, in <module>
    main()
  File "tools/train.py", line 129, in main
    runner.train()
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1728, in train
    self._train_loop = self.build_train_loop(
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1520, in build_train_loop
    loop = LOOPS.build(
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/loops.py", line 44, in __init__
    super().__init__(runner, dataloader)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/base_loop.py", line 26, in __init__
    self.dataloader = runner.build_dataloader(
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1370, in build_dataloader
    dataset = DATASETS.build(dataset_cfg)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/dataset/dataset_wrapper.py", line 223, in __init__
    self.dataset = DATASETS.build(dataset)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/root/wwf/EmbodiedScan/embodiedscan/datasets/embodiedscan_dataset.py", line 59, in __init__
    super().__init__(ann_file=ann_file,
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/dataset/base_dataset.py", line 247, in __init__
    self.full_init()
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/dataset/base_dataset.py", line 298, in full_init
    self.data_list = self.load_data_list()
  File "/root/wwf/EmbodiedScan/embodiedscan/datasets/embodiedscan_dataset.py", line 342, in load_data_list
    data_info = self.parse_data_info(raw_data_info)
  File "/root/wwf/EmbodiedScan/embodiedscan/datasets/embodiedscan_dataset.py", line 147, in parse_data_info
    info['ann_info'] = self.parse_ann_info(info)
  File "/root/wwf/EmbodiedScan/embodiedscan/datasets/embodiedscan_dataset.py", line 238, in parse_ann_info
    occ_masks = mmengine.load(mask_filename)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/fileio/io.py", line 856, in load
    obj = handler.load_from_fileobj(f, **kwargs)
  File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/fileio/handlers/pickle_handler.py", line 12, in load_from_fileobj
    return pickle.load(file, **kwargs)
_pickle.UnpicklingError: pickle data was truncated

Additional information

No response

@mxh1999
Copy link
Collaborator

mxh1999 commented Sep 6, 2024

It looks like the annotation file you downloaded is broken, try downloading it again.

@Mintinson
Copy link
Author

Thanks for your answer!

I re-downloaded the dataset you guys placed on Google Drive and also re-ran the script extract_occupancy_ann.py and it shows that everything is fine. But it still reports the same error when training.

I noticed that the README under the data folder shows json files starting with embodiedscan_infos, while the data extracted on Google Drive starts with embodiedscan, does this matter? Do I have to change these filenames?

By the way, I would also like to know if this warning is normal? If not, what should I do to get rid of it.

09/06 03:16:31 - mmengine - Warning - Failed to search the “loop” registry tree for registries in the range “embodiedscan”. As a workaround, the current “loop” registry in “mmengine” is used to build the instance. This may cause unexpected failures when running the built module. Please check that “embodiedscan” is the correct scope, or that the registry is initialized.
09/06 03:16:31 - mmengine - Warning - euler-depth is not a metafile, just parsed as meta-information

@mxh1999
Copy link
Collaborator

mxh1999 commented Sep 6, 2024

@Mintinson
Could you please provide the sample_idx of this scene?
Just replace

occ_masks = mmengine.load(mask_filename)

with

try:
    occ_masks = mmengine.load(mask_filename)
except:
    print(info['sample_idx'])
    raise ValueError

This helps us to localize the problem.

@mxh1999 mxh1999 self-assigned this Sep 6, 2024
@Mintinson
Copy link
Author

Here is the output:

scannet/scene0031_00
Traceback (most recent call last):
 ...

and here is the structure of the corresponding scene:

location: data/scannet/scans/scene0031_00/

scene0031_00
├── occupancy
│   ├── occupancy.npy
│   └── visible_occupancy.pkl
├── scene0031_00_2d-instance-filt.zip
├── scene0031_00_2d-instance.zip
├── scene0031_00_2d-label-filt.zip
├── scene0031_00_2d-label.zip
├── scene0031_00.aggregation.json
├── scene0031_00.sens
├── scene0031_00.txt
├── scene0031_00_vh_clean_2.0.010000.segs.json
├── scene0031_00_vh_clean_2.labels.ply
├── scene0031_00_vh_clean_2.ply
├── scene0031_00_vh_clean.aggregation.json
├── scene0031_00_vh_clean.ply
└── scene0031_00_vh_clean.segs.json

1 directory, 15 files

location: data/scannet/scans/posed_images/scene0031_00/

scene0031_00
├── 00000.jpg
├── 00000.png
├── 00000.txt
├── 00010.jpg
├── ...
├── 02750.txt
├── depth_intrinsic.txt
├── intrinsic.txt

location: data/embodiedscan_occupancy/scannet/scene0031_00/

scene0031_00
├── occupancy.npy
├── visible_occupancy.pkl

@mxh1999
Copy link
Collaborator

mxh1999 commented Sep 6, 2024

@Mintinson
Could you please check the the sha256 hash values of visible_occupancy.pkl and occupancy.npy?
The hash of visible_occupancy.pkl is 405f14770ab2126e24282977d5f897d1b35569bfea3f60431d63351def49ef3a and the hash of occupancy.npy is da1b32fd3753626401446669f6df3edd3530783e784a5edee01e56c78eb6b5d1.

@Mintinson
Copy link
Author

Thank you so much for your help! I checked the hash value of visible_occupancy.pkl and found that it was indeed different from the visible_occupancy.pkl hash value within embodiedscan_occupancy, I deleted the occupancy folder in raw data and ran the script again:

python embodiedscan/converter/extract_occupancy_ann.py --src data/embodiedscan_occupancy --dst data

This time the file has the correct hash value! I'm not sure what went wrong the first time I extracted these annotations. But now train.py is able to allow it without reporting errors!

I would like to ask how much memory this project needs to run, when I run train.py it gets killed because of out of memory.

@mxh1999
Copy link
Collaborator

mxh1999 commented Sep 7, 2024

The memory problem is caused by the design of mmengine dataloader which will copy annotation files num_gpu * num_workers times. We are trying to fix this problem.

For a quick solution, you can see #29 for detail.

@Mintinson
Copy link
Author

I tried the above solution but it didn't work. I am wondering if 125 G of RAM is enough? Do I need more RAM so that I am able to replace my server earlier?

@mxh1999
Copy link
Collaborator

mxh1999 commented Sep 7, 2024

It usually costs ~140G RAM on my server. Maybe you can try setting fewer dataloader workers in config?

@Mintinson
Copy link
Author

I will try that. Thank you for your timely help~

@Mintinson
Copy link
Author

I would like to ask why this project is taking up so much RAM, all the projects I have done before have taken up less than 30G of memory on loading data, why is this reaching hundreds. Also, what are the GPU memory requirements for this project? So that I can allocate the hardware resources in time.

@mxh1999
Copy link
Collaborator

mxh1999 commented Sep 10, 2024

I apologize for the RAM memory problem. We are working on fixing it.
For GPU memory, the default setting of Embodiedscan Detection Task like mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py requires ~20G GPU memory. It can be further reduced by decreasing batch size.

PS: The default setting totally uses ~600G RAM. I'm sorry for the previous incorrect response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants