You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
frustrated after training about 1654/ba it corrupted, failed to save the checkpoint, tried two times.
Error as follows:
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=39739, OpType=ALLREDUCE, Timeout(ms)=300000) ran for 302714 milliseconds before timing out.
train 4%|▉ /home/anaconda3/envs/control/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 3 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
----------End global rank 3 STDERR----------
ERROR:composer.cli.launcher:Global rank 0 (PID 35121) has still not exited; return exit code 1.
The text was updated successfully, but these errors were encountered:
frustrated after training about 1654/ba it corrupted, failed to save the checkpoint, tried two times.
Error as follows:
The text was updated successfully, but these errors were encountered: