You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Neuronx distributed Llama2 7B PyTorch Lightning example has fatal error when trying to save checkpoint after 100 global steps.
Logs are included below:
...
Epoch 0: 41%|████▏ | 6144/14876 [1:06:58<1:35:10, 1.53it/s, v_num=0, loss=6.000, lr=0.000285, input_ids=5.21e+7, throughput=24.60, global_step_step=95.00]step 96 loss is 6.002302169799805, lr is 0.00028799999999999995, throughput 24.631530041674928 seq/s, input_ids 35873595, norm tensor([3.2969], device='xla:0'), global rank 0
Epoch 0: 42%|████▏ | 6208/14876 [1:07:39<1:34:28, 1.53it/s, v_num=0, loss=6.000, lr=0.000288, input_ids=3.59e+7, throughput=24.60, global_step_step=96.00]step 97 loss is 5.956272125244141, lr is 0.00029099999999999997, throughput 24.63089836600094 seq/s, input_ids 37864961, norm tensor([3.1562], device='xla:0'), global rank 0
Epoch 0: 42%|████▏ | 6272/14876 [1:08:21<1:33:46, 1.53it/s, v_num=0, loss=5.960, lr=0.000291, input_ids=3.79e+7, throughput=24.60, global_step_step=97.00]step 98 loss is 5.935997486114502, lr is 0.000294, throughput 24.63518937768065 seq/s, input_ids 54373382, norm tensor([2.7344], device='xla:0'), global rank 0
Epoch 0: 43%|████▎ | 6336/14876 [1:09:02<1:33:03, 1.53it/s, v_num=0, loss=5.940, lr=0.000294, input_ids=5.44e+7, throughput=24.60, global_step_step=98.00]step 99 loss is 5.935565948486328, lr is 0.00029699999999999996, throughput 24.63476866876432 seq/s, input_ids 39750648, norm tensor([3.1250], device='xla:0'), global rank 0
Epoch 0: 43%|████▎ | 6400/14876 [1:09:44<1:32:21, 1.53it/s, v_num=0, loss=5.940, lr=0.000297, input_ids=3.98e+7, throughput=24.60, global_step_step=99.00][2024-04-10 16:02:47.461: I neuronx_distributed/parallel_layers/checkpointing.py:75] saving checkpoint to /efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v1.ckpt
Traceback (most recent call last):
File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 373, in <module>
Traceback (most recent call last):
File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 373, in <module>
Traceback (most recent call last):
File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 373, in <module>
_mp_fn(0, args)
File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 224, in _mp_fn
_mp_fn(0, args)
File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 224, in _mp_fn
train_llama(args)
File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 218, in train_llama
train_llama(args)
File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 218, in train_llama
trainer.fit(model=model, datamodule=dm)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
_mp_fn(0, args)
File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 224, in _mp_fn
trainer.fit(model=model, datamodule=dm)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
train_llama(args)
File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 218, in train_llama
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
trainer.fit(model=model, datamodule=dm)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/launcher.py", line 71, in launch
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/launcher.py", line 71, in launch
results = function(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
results = function(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/launcher.py", line 71, in launch
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
results = function(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
results = self._run_stage()
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage
results = self._run_stage()
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage
results = self._run_stage()
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage
self.fit_loop.run()
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.fit_loop.run()
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.advance()
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
self.advance()
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
self.fit_loop.run()
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.epoch_loop.run(self._data_fetcher)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
self.epoch_loop.run(self._data_fetcher)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
self.advance()
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
self.advance(data_fetcher)
self.advance(data_fetcher)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 259, in advance
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 259, in advance
self.epoch_loop.run(self._data_fetcher)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
self.advance(data_fetcher)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 259, in advance
fn(trainer, trainer.lightning_module, *args, **kwargs)
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 303, in on_train_batch_end
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 303, in on_train_batch_end
call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
self._save_topk_checkpoint(trainer, monitor_candidates)
self._save_topk_checkpoint(trainer, monitor_candidates)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 368, in _save_topk_checkpoint
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 368, in _save_topk_checkpoint
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 303, in on_train_batch_end
self._save_monitor_checkpoint(trainer, monitor_candidates)self._save_monitor_checkpoint(trainer, monitor_candidates)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 681, in _save_monitor_checkpoint
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 681, in _save_monitor_checkpoint
self._save_topk_checkpoint(trainer, monitor_candidates)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 368, in _save_topk_checkpoint
self._save_monitor_checkpoint(trainer, monitor_candidates)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 681, in _save_monitor_checkpoint
self._update_best_and_save(current, trainer, monitor_candidates)
self._update_best_and_save(current, trainer, monitor_candidates)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 733, in _update_best_and_save
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 733, in _update_best_and_save
self._update_best_and_save(current, trainer, monitor_candidates)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 733, in _update_best_and_save
self._save_checkpoint(trainer, filepath)self._save_checkpoint(trainer, filepath)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 373, in _save_checkpoint
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 373, in _save_checkpoint
trainer.save_checkpoint(filepath, self.save_weights_only)trainer.save_checkpoint(filepath, self.save_weights_only)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1384, in save_checkpoint
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1384, in save_checkpoint
self._save_checkpoint(trainer, filepath)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 373, in _save_checkpoint
trainer.save_checkpoint(filepath, self.save_weights_only)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1384, in save_checkpoint
self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/strategy.py", line 195, in save_checkpoint
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/strategy.py", line 195, in save_checkpoint
self.checkpoint_io.save_checkpoint(self.checkpoint_io.save_checkpoint(
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/checkpoint_io.py", line 67, in save_checkpoint
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/checkpoint_io.py", line 67, in save_checkpoint
self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/strategy.py", line 195, in save_checkpoint
save(save(
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/checkpointing.py", line 100, in save
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/checkpointing.py", line 100, in save
self.checkpoint_io.save_checkpoint(xser.save(checkpoint, chkpt_path, (not master_only), global_master=True)xser.save(checkpoint, chkpt_path, (not master_only), global_master=True)
File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/serialization.py", line 74, in save
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/checkpoint_io.py", line 67, in save_checkpoint
File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/serialization.py", line 74, in save
save(
ref_data = _rewrite_data(_get_tensors_folder(path), data, should_write_data)
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/checkpointing.py", line 100, in save
ref_data = _rewrite_data(_get_tensors_folder(path), data, should_write_data)
File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/serialization.py", line 42, in _rewrite_data
File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/serialization.py", line 42, in _rewrite_data
os.mkdir(path)
os.mkdir(path)
xser.save(checkpoint, chkpt_path, (not master_only), global_master=True)
FileNotFoundError File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/serialization.py", line 74, in save
FileNotFoundError: [Errno 2] No such file or directory: '/efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt/tp_rank_04_pp_rank_00_dp_rank_00.tensors':
[Errno 2] No such file or directory: '/efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt/tp_rank_03_pp_rank_00_dp_rank_00.tensors'
ref_data = _rewrite_data(_get_tensors_folder(path), data, should_write_data)
File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/serialization.py", line 42, in _rewrite_data
os.mkdir(path)
FileNotFoundError: [Errno 2] No such file or directory: '/efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt/tp_rank_01_pp_rank_00_dp_rank_00.tensors'
[2024-04-10 16:02:56,192] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 252 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 254 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 257 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 258 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 259 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 260 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 261 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 262 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 263 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 264 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 265 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 266 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 267 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 268 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 269 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 270 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 271 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 272 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 273 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 274 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 275 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 276 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 277 closing signal SIGTERM
[2024-04-10 16:02:56,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 278 closing signal SIGTERM
[2024-04-10 16:02:56,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 279 closing signal SIGTERM
[2024-04-10 16:02:56,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 280 closing signal SIGTERM
[2024-04-10 16:02:56,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 281 closing signal SIGTERM
[2024-04-10 16:02:56,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 282 closing signal SIGTERM
[2024-04-10 16:02:56,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 283 closing signal SIGTERM
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 8] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 30] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 25] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 31] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 28] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 14] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 27] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 9] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 26] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 19] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 7] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 22] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 13] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 16] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 29] Received SIGTERM: 15
[2024-04-10 16:03:26,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 252 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 10] Received SIGTERM: 15
[2024-04-10 16:03:40,240] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 254 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 24] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 23] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 17] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 20] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 6] Received SIGTERM: 15
[2024-04-10 16:03:50,705] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 257 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:03:51,563] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 258 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:03:53,691] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 259 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 11] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 18] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 12] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 15] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 21] Received SIGTERM: 15
[2024-04-10 16:04:01,468] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 260 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:04:02,309] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 261 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:04:10,826] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 262 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:04:21,412] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 263 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:04:34,970] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 264 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:04:36,068] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 265 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:05:00,717] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 266 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:05:01,508] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 267 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:05:03,732] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 268 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:05:06,853] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 269 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:05:23,834] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 270 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:05:44,821] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 271 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:04,373] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 272 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:05,174] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 273 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:05,954] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 274 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:07,961] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 275 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:08,895] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 276 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:09,811] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 277 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:12,247] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 278 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:20,465] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 279 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:21,228] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 280 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:22,496] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 281 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:23,269] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 282 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:24,205] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 283 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:27,085] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 253) of binary: /usr/local/bin/python3.10
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_llama_nxd_ptl.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-04-10_16:02:56
host : pytorchjob-nxd-llama2-7b-ptl-master-0
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 255)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-04-10_16:02:56
host : pytorchjob-nxd-llama2-7b-ptl-master-0
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 256)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-04-10_16:02:56
host : pytorchjob-nxd-llama2-7b-ptl-master-0
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 253)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
root@attach-pvc:/efs/home/nxd-llama2-7b-ptl/logs/0£ ls -al /efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt/tp_rank_01_pp_rank_00_dp_rank_00.tensors
ls: cannot access '/efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt/tp_rank_01_pp_rank_00_dp_rank_00.tensors': No such file or directory
root@attach-pvc:/efs/home/nxd-llama2-7b-ptl/logs/0£ ls -al /efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt
ls: cannot access '/efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt': No such file or directory
root@attach-pvc:/efs/home/nxd-llama2-7b-ptl/logs/0£ ls -al /efs/home/nxd-llama2-7b-ptl/checkpoints/
total 24
The text was updated successfully, but these errors were encountered:
Neuronx distributed Llama2 7B PyTorch Lightning example has fatal error when trying to save checkpoint after 100 global steps.
Logs are included below:
The text was updated successfully, but these errors were encountered: