Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to mount GPU devices correctly in nsjail? #237

Open
radkris-git opened this issue Aug 12, 2024 · 2 comments
Open

How to mount GPU devices correctly in nsjail? #237

radkris-git opened this issue Aug 12, 2024 · 2 comments

Comments

@radkris-git
Copy link

Hi, I'm trying to run a simple "pytorch tensor add" on GPU under nsjail on a GCP nvidia-tesla-t4 node and i'm getting the following error.

nsjail_pytorch.cfg

mount {
  src: "/home/current_user_ldap/pytorch_env"
  dst: "/home/current_user_ldap/pytorch_env"
  is_bind: true
}
mount {
  src: "/dev/nvidia0"
  dst: "/dev/nvidia0"
  is_bind: true
  rw: true
}
mount {
  src: "/dev/nvidiactl"
  dst: "/dev/nvidiactl"
  is_bind: true
  rw: true
}
mount {
  src: "/dev/nvidia-uvm"
  dst: "/dev/nvidia-uvm"
  is_bind: true
  rw: true
}
mount {
  src: "/usr"
  dst: "/usr"
  is_bind: true
  rw: true
}
# for libs
mount {
  src: "/lib64"
  dst: "/lib64"
  is_bind: true
}
mount {
  src: "/lib"
  dst: "/lib"
  is_bind: true
  rw: true
}
cwd: "/home/current_user_ldap/pytorch_env/"

Running simple PyTorch Tensor Add on CPU works.

nsjail -Mo --chroot /   --rlimit_nproc 6553   --rlimit_fsize inf --rlimit_as inf   -- /usr/bin/python3 -c "import torch; a = torch.tensor([1.0, 2.0], device='cpu') + torch.tensor([3.0, 4.0], device='cpu'); print(a)" 

This prints the expected tensor output of [4, 6]

Running simple PyTorch Tensor Add on GPU fails

nsjail -Mo --config nsjail_pytorch.cfg  --chroot /  --rlimit_nproc 6553   --rlimit_fsize inf --rlimit_as inf    -- /usr/bin/python3 -c "import torch; print(torch.cuda.is_available());"
[I][2024-08-10T02:03:04+0000] Mode: STANDALONE_ONCE
[I][2024-08-10T02:03:04+0000] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/usr/bin/python3', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:600, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2024-08-10T02:03:04+0000] Mount: '/' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/home/current_user_ldap/pytorch_env' -> '/home/current_user_ldap/pytorch_env' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/dev/nvidia0' -> '/dev/nvidia0' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:false
[I][2024-08-10T02:03:04+0000] Mount: '/dev/nvidiactl' -> '/dev/nvidiactl' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:false
[I][2024-08-10T02:03:04+0000] Mount: '/dev/nvidia-uvm' -> '/dev/nvidia-uvm' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:false
[I][2024-08-10T02:03:04+0000] Mount: '/usr' -> '/usr' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/lib64' -> '/lib64' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/lib' -> '/lib' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Uid map: inside_uid:1002 outside_uid:1002 count:1 newuidmap:false
[I][2024-08-10T02:03:04+0000] Gid map: inside_gid:1003 outside_gid:1003 count:1 newgidmap:false
[I][2024-08-10T02:03:06+0000] Executing '/usr/bin/python3' for '[STANDALONE MODE]'
/home/current_user_ldap/.local/lib/python3.9/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False
[I][2024-08-10T02:03:08+0000] pid=28434 ([STANDALONE MODE]) exited with status: 0, (PIDs left: 0)

NVIDIA-SMI runs fine under nsjail

nsjail -Mo --config nsjail_pytorch.cfg  --chroot /  --rlimit_nproc 6553 --rlimit_as inf   -- /bin/nvidia-smi

The above prints, the actual nvidia-smi output successfully.

Notes

  • PyTorch works fine under nsjail (No issues)
  • nvidia-smi works under nsjail
  • Running PyTorch without nsjail on GPU succeeds.

This doesn't look like pytorch or the host issue provided pytorch works on GPU without nsjail. Any help appreciated.

@etai-shuchatowitz
Copy link

Hi!

I was wondering if you ever figured this out? Running into this issue myself.

@etai-shuchatowitz
Copy link

Can't vouch for whether or not this works with pytorch as I'm using tensorflow myself but I was able to get things working by adding

clone_newnet: false 
clone_newuser: false
clone_newns: false
clone_newpid: false
clone_newipc: false
clone_newuts: false
clone_newcgroup: false

to my nsjail.cfg file.

Source: #232 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants