0% found this document useful (0 votes)
148 views3 pages

Training Error

The document contains error logs from a training session indicating multiple failures related to container startup and communication issues. Specifically, it highlights errors with the Pyxis container and TCPStore, including detection errors and failed message transmissions. The training was halted after reaching the maximum steps due to these persistent errors.

Uploaded by

kukrejanitin17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views3 pages

Training Error

The document contains error logs from a training session indicating multiple failures related to container startup and communication issues. Specifically, it highlights errors with the Pyxis container and TCPStore, including detection errors and failed message transmissions. The training was halted after reaching the maximum steps due to these persistent errors.

Uploaded by

kukrejanitin17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd

Error 1:

npciusr@ch2npciaih100g01:/npcipfs/data/rkalani/v6_Mistral_24B_LoRA/
v6_Mistral_24B_LoRA_1755697821/lora_training$ cat log-sample-
sample.lora_training_200_0.out
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis: nvidia-container-cli: detection error: nvml error:
unknown error
slurmstepd: error: pyxis: [ERROR] /etc/enroot/hooks.d/[Link] exited with
return code 1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with
rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with
rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with
rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with
rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with
rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with
rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with
rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with
rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: * STEP 200.1 ON ch2npciaih100g09 CANCELLED AT 2025-08-
20T[Link] *
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused

Error 2:

Epoch 19: | | 10/? [00:02<00:00, 3.62it/s, train_step_timing in s=0.181,


reduced_train_loss=6.720, tps=2.2e+5, lr=2e-5][Link] stopped: max_steps=20
reached.
Epoch 19: | | 10/? [00:02<00:00, 3.62it/s, train_step_timing in s=0.181,
reduced_train_loss=6.720, tps=2.2e+5, lr=2e-5]
[rank47]:[W820 [Link].878765964 [Link]] [c10d] recvValue failed on
SocketImpl(fd=98, addr=[[Link]]:34720,
remote=[[Link]]:15201): failed to recv, got 0 bytes
Exception raised from recvBytes at
/opt/pytorch/pytorch/torch/csrc/distributed/c10d/[Link] (most recent call
first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >) + 0x88 (0x1554dfa73568 in
/usr/local/lib/python3.12/dist-packages/torch/lib/[Link])
frame #1: <unknown function> + 0x599d02e (0x15553b86702e in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x599f260 (0x15553b869260 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x599fb6a (0x15553b869b6a in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::_cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >,
std::allocator<std::_cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > > > const&) + 0x2a9 (0x15553b863629 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x371 (0x1554e0838961 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xecdb4 (0x1554c59cddb4 in /usr/lib/x86_64-linux-
gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x9caa4 (0x155555260aa4 in /usr/lib/x86_64-linux-
gnu/[Link].6)
frame #8: <unknown function> + 0x129c3c (0x1555552edc3c in /usr/lib/x86_64-linux-
gnu/[Link].6)

[rank47]:[W820 [Link].883910143 [Link]] [PG ID 0 PG GUID


0(default_pg) Rank 47] Failed to check the "should dump" flag on TCPStore, (maybe
TCPStore server has shut down too early), with error: failed to recv, got 0 bytes
[rank33]:[W820 [Link].967345972 [Link]] [c10d] recvValue failed on
SocketImpl(fd=98, addr=[[Link]]:56360,
remote=[[Link]]:15201): failed to recv, got 0 bytes
Exception raised from recvBytes at
/opt/pytorch/pytorch/torch/csrc/distributed/c10d/[Link] (most recent call
first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >) + 0x88 (0x1554dfa73568 in
/usr/local/lib/python3.12/dist-packages/torch/lib/[Link])
frame #1: <unknown function> + 0x599d02e (0x15553b86702e in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x599f260 (0x15553b869260 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x599fb6a (0x15553b869b6a in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::_cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >,
std::allocator<std::_cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > > > const&) + 0x2a9 (0x15553b863629 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x371 (0x1554e0838961 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xecdb4 (0x1554c59cddb4 in /usr/lib/x86_64-linux-
gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x9caa4 (0x155555260aa4 in /usr/lib/x86_64-linux-
gnu/[Link].6)
frame #8: <unknown function> + 0x129c3c (0x1555552edc3c in /usr/lib/x86_64-linux-
gnu/[Link].6)
[rank33]:[W820 [Link].973613615 [Link]] [PG ID 0 PG GUID
0(default_pg) Rank 33] Failed to check the "should dump" flag on TCPStore, (maybe
TCPStore server has shut down too early), with error: failed to recv, got 0 bytes
[rank44]:[W820 [Link].074096526 [Link]] [c10d] recvValue failed on
SocketImpl(fd=98, addr=[[Link]]:34742,
remote=[[Link]]:15201): failed to recv, got 0 bytes
Exception raised from recvBytes at
/opt/pytorch/pytorch/torch/csrc/distributed/c10d/[Link] (most recent call
first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >) + 0x88 (0x1554dfa73568 in
/usr/local/lib/python3.12/dist-packages/torch/lib/[Link])
frame #1: <unknown function> + 0x599d02e (0x15553b86702e in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x599f260 (0x15553b869260 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x599fb6a (0x15553b869b6a in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::_cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >,
std::allocator<std::_cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > > > const&) + 0x2a9 (0x15553b863629 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x371 (0x1554e0838961 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xecdb4 (0x1554c59cddb4 in /usr/lib/x86_64-linux-
gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x9caa4 (0x155555260aa4 in /usr/lib/x86_64-linux-
gnu/[Link].6)
frame #8: <unknown function> + 0x129c3c (0x1555552edc3c in /usr/lib/x86_64-linux-
gnu/[Link].6)

[rank44]:[W820 [Link].080377997 [Link]] [PG ID 0 PG GUID


0(default_pg) Rank 44] Failed to check the "should dump" flag on TCPStore, (maybe
TCPStore server has shut down too early), with error: failed to recv, got 0 bytes
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Zero Bytes were transmitted or
received
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Zero Bytes were transmitted or
received
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Zero Bytes were transmitted or
received
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Zero Bytes were transmitted or
received
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Zero Bytes were transmitted or
received
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Zero Bytes were transmitted or
received
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Zero Bytes were transmitted or
received
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Zero Bytes were transmitted or
received
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Zero Bytes were transmitted or
received

You might also like