Error 1:
npciusr@ch2npciaih100g01:/npcipfs/data/rkalani/v6_Mistral_24B_LoRA/
v6_Mistral_24B_LoRA_1755697821/lora_training$ cat log-sample-
sample.lora_training_200_0.out
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis: nvidia-container-cli: detection error: nvml error:
unknown error
slurmstepd: error: pyxis: [ERROR] /etc/enroot/hooks.d/[Link] exited with
return code 1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with
rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with
rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with
rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with
rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with
rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with
rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with
rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with
rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: * STEP 200.1 ON ch2npciaih100g09 CANCELLED AT 2025-08-
20T[Link] *
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
Error 2:
Epoch 19: | | 10/? [00:02<00:00, 3.62it/s, train_step_timing in s=0.181,
reduced_train_loss=6.720, tps=2.2e+5, lr=2e-5][Link] stopped: max_steps=20
reached.
Epoch 19: | | 10/? [00:02<00:00, 3.62it/s, train_step_timing in s=0.181,
reduced_train_loss=6.720, tps=2.2e+5, lr=2e-5]
[rank47]:[W820 [Link].878765964 [Link]] [c10d] recvValue failed on
SocketImpl(fd=98, addr=[[Link]]:34720,
remote=[[Link]]:15201): failed to recv, got 0 bytes
Exception raised from recvBytes at
/opt/pytorch/pytorch/torch/csrc/distributed/c10d/[Link] (most recent call
first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >) + 0x88 (0x1554dfa73568 in
/usr/local/lib/python3.12/dist-packages/torch/lib/[Link])
frame #1: <unknown function> + 0x599d02e (0x15553b86702e in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x599f260 (0x15553b869260 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x599fb6a (0x15553b869b6a in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::_cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >,
std::allocator<std::_cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > > > const&) + 0x2a9 (0x15553b863629 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x371 (0x1554e0838961 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xecdb4 (0x1554c59cddb4 in /usr/lib/x86_64-linux-
gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x9caa4 (0x155555260aa4 in /usr/lib/x86_64-linux-
gnu/[Link].6)
frame #8: <unknown function> + 0x129c3c (0x1555552edc3c in /usr/lib/x86_64-linux-
gnu/[Link].6)
[rank47]:[W820 [Link].883910143 [Link]] [PG ID 0 PG GUID
0(default_pg) Rank 47] Failed to check the "should dump" flag on TCPStore, (maybe
TCPStore server has shut down too early), with error: failed to recv, got 0 bytes
[rank33]:[W820 [Link].967345972 [Link]] [c10d] recvValue failed on
SocketImpl(fd=98, addr=[[Link]]:56360,
remote=[[Link]]:15201): failed to recv, got 0 bytes
Exception raised from recvBytes at
/opt/pytorch/pytorch/torch/csrc/distributed/c10d/[Link] (most recent call
first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >) + 0x88 (0x1554dfa73568 in
/usr/local/lib/python3.12/dist-packages/torch/lib/[Link])
frame #1: <unknown function> + 0x599d02e (0x15553b86702e in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x599f260 (0x15553b869260 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x599fb6a (0x15553b869b6a in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::_cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >,
std::allocator<std::_cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > > > const&) + 0x2a9 (0x15553b863629 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x371 (0x1554e0838961 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xecdb4 (0x1554c59cddb4 in /usr/lib/x86_64-linux-
gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x9caa4 (0x155555260aa4 in /usr/lib/x86_64-linux-
gnu/[Link].6)
frame #8: <unknown function> + 0x129c3c (0x1555552edc3c in /usr/lib/x86_64-linux-
gnu/[Link].6)
[rank33]:[W820 [Link].973613615 [Link]] [PG ID 0 PG GUID
0(default_pg) Rank 33] Failed to check the "should dump" flag on TCPStore, (maybe
TCPStore server has shut down too early), with error: failed to recv, got 0 bytes
[rank44]:[W820 [Link].074096526 [Link]] [c10d] recvValue failed on
SocketImpl(fd=98, addr=[[Link]]:34742,
remote=[[Link]]:15201): failed to recv, got 0 bytes
Exception raised from recvBytes at
/opt/pytorch/pytorch/torch/csrc/distributed/c10d/[Link] (most recent call
first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >) + 0x88 (0x1554dfa73568 in
/usr/local/lib/python3.12/dist-packages/torch/lib/[Link])
frame #1: <unknown function> + 0x599d02e (0x15553b86702e in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x599f260 (0x15553b869260 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x599fb6a (0x15553b869b6a in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::_cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >,
std::allocator<std::_cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > > > const&) + 0x2a9 (0x15553b863629 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x371 (0x1554e0838961 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xecdb4 (0x1554c59cddb4 in /usr/lib/x86_64-linux-
gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x9caa4 (0x155555260aa4 in /usr/lib/x86_64-linux-
gnu/[Link].6)
frame #8: <unknown function> + 0x129c3c (0x1555552edc3c in /usr/lib/x86_64-linux-
gnu/[Link].6)
[rank44]:[W820 [Link].080377997 [Link]] [PG ID 0 PG GUID
0(default_pg) Rank 44] Failed to check the "should dump" flag on TCPStore, (maybe
TCPStore server has shut down too early), with error: failed to recv, got 0 bytes
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Zero Bytes were transmitted or
received
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Zero Bytes were transmitted or
received
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Zero Bytes were transmitted or
received
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Zero Bytes were transmitted or
received
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Zero Bytes were transmitted or
received
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Zero Bytes were transmitted or
received
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Zero Bytes were transmitted or
received
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Zero Bytes were transmitted or
received
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Zero Bytes were transmitted or
received