Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worker process died #25

Open
ypwang61 opened this issue Feb 3, 2025 · 7 comments
Open

worker process died #25

ypwang61 opened this issue Feb 3, 2025 · 7 comments

Comments

@ypwang61
Copy link

ypwang61 commented Feb 3, 2025

Thanks for your cool work!!

when trying to run experiments on 5nodes with 40 A100, I face this error after the first training epoch. What may be the potential reason for this?

Traceback (most recent call last):
  File "/scratch/amlt_code/train/openrlhf/cli/train_ppo_ray_box.py", line 395, in <module>
    train(args)
  File "/scratch/amlt_code/train/openrlhf/cli/train_ppo_ray_box.py", line 175, in train
    ray.get(refs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 863, in get_objects
    raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
	class_name: ActorModelRayActorBOX
	actor_id: f20da4dcf474d8bb05e952bb03000000
	pid: 5226
	namespace: aa07eaa0-19a3-4a8b-9095-8872e5d243ba
	ip: 10.7.34.138
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
@Zeng-WH
Copy link
Collaborator

Zeng-WH commented Feb 3, 2025

Could you please specify where exactly the error occurred? For example, was it during rollout, making experience, gradient descent, or broadcasting phase?

@ypwang61
Copy link
Author

ypwang61 commented Feb 3, 2025

It happens during making experience, and it's pretty random, sometimes happen after the 1st step, sometimes after 10 global steps (where I use vllm_engine=8 rather than 16)

make_experience:  24%|██▍       | 124/512 [03:11<11:33,  1.79s/it]
�[36m(ActorModelRayActorBOX pid=4809, ip=10.7.35.246)�[0m 
make_experience:  24%|██▍       | 125/512 [03:13<11:52,  1.84s/it]
�[36m(ActorModelRayActorBOX pid=4809, ip=10.7.35.246)�[0m 
make_experience:  25%|██▍       | 126/512 [03:15<11:52,  1.85s/it]
�[36m(ActorModelRayActorBOX pid=4951, ip=10.7.35.246)�[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...�[32m [repeated 46x across cluster]�[0m
�[36m(ActorModelRayActorBOX pid=4951, ip=10.7.35.246)�[0m To disable this warning, you can either:�[32m [repeated 46x across cluster]�[0m
�[36m(ActorModelRayActorBOX pid=4951, ip=10.7.35.246)�[0m 	- Avoid using `tokenizers` before the fork if possible�[32m [repeated 46x across cluster]�[0m
�[36m(ActorModelRayActorBOX pid=4951, ip=10.7.35.246)�[0m 	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)�[32m [repeated 46x across cluster]�[0m
�[36m(ActorModelRayActorBOX pid=4809, ip=10.7.35.246)�[0m 
make_experience:  25%|██▍       | 127/512 [03:17<12:01,  1.87s/it]
�[36m(ActorModelRayActorBOX pid=4809, ip=10.7.35.246)�[0m 
make_experience:  25%|██▌       | 128/512 [03:19<12:44,  1.99s/it]
�[36m(ActorModelRayActorBOX pid=4956, ip=10.7.35.246)�[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...�[32m [repeated 47x across cluster]�[0m
�[36m(ActorModelRayActorBOX pid=4956, ip=10.7.35.246)�[0m To disable this warning, you can either:�[32m [repeated 47x across cluster]�[0m
�[36m(ActorModelRayActorBOX pid=4956, ip=10.7.35.246)�[0m 	- Avoid using `tokenizers` before the fork if possible�[32m [repeated 47x across cluster]�[0m
�[36m(ActorModelRayActorBOX pid=4956, ip=10.7.35.246)�[0m 	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)�[32m [repeated 47x across cluster]�[0m
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [rank5]:[E203 10:34:03.416910779 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1495581, OpType=_ALLGATHER_BASE, NumelIn=68124672, NumelOut=544997376, Timeout(ms)=1800000) ran for 1800036 milliseconds before timing out.
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [rank5]:[E203 10:34:03.417077239 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 5] Exception (either an error or timeout) detected by watchdog at work: 1495581, last enqueued NCCL work: 1495583, last completed NCCL work: 1495580.
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [rank5]:[E203 10:34:03.417091806 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 5] Timeout at NCCL work: 1495581, last enqueued NCCL work: 1495583, last completed NCCL work: 1495580.
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [rank5]:[E203 10:34:03.417110852 ProcessGroupNCCL.cpp:630] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [rank5]:[E203 10:34:03.417116102 ProcessGroupNCCL.cpp:636] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [rank5]:[E203 10:34:03.418413023 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1495581, OpType=_ALLGATHER_BASE, NumelIn=68124672, NumelOut=544997376, Timeout(ms)=1800000) ran for 1800036 milliseconds before timing out.
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fa84b4b9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f76ba7aa772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f76ba7b1bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f76ba7b361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #4: <unknown function> + 0xdc253 (0x7fa85d8aa253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #5: <unknown function> + 0x94ac3 (0x7fa85fb1fac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #6: <unknown function> + 0x126850 (0x7fa85fbb1850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m 
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [2025-02-03 10:34:03,841 E 4954 6881] logging.cc:101: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG ID 0 PG GUID 0(default_pg) Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1495581, OpType=_ALLGATHER_BASE, NumelIn=68124672, NumelOut=544997376, Timeout(ms)=1800000) ran for 1800036 milliseconds before timing out.
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fa84b4b9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f76ba7aa772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f76ba7b1bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f76ba7b361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #4: <unknown function> + 0xdc253 (0x7fa85d8aa253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #5: <unknown function> + 0x94ac3 (0x7fa85fb1fac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #6: <unknown function> + 0x126850 (0x7fa85fbb1850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m 
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fa84b4b9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #1: <unknown function> + 0xe4271b (0x7f76ba42071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #2: <unknown function> + 0xdc253 (0x7fa85d8aa253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #3: <unknown function> + 0x94ac3 (0x7fa85fb1fac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #4: <unknown function> + 0x126850 (0x7fa85fbb1850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m 
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [2025-02-03 10:34:03,861 E 4954 6881] logging.cc:108: Stack trace: 
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m  /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xfe573a) [0x7fa85e9e473a] ray::operator<<()
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xfe81f8) [0x7fa85e9e71f8] ray::TerminateHandler()
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7fa85d87c20c]
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7fa85d87c277]
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7fa85d87c1fe]
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe427c9) [0x7f76ba4207c9] c10d::ProcessGroupNCCL::ncclCommWatchdog()
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fa85d8aa253]
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fa85fb1fac3]
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fa85fbb1850]
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m 
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m *** SIGABRT received at time=1738578843 on cpu 55 ***
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m PC: @     0x7fa85fb219fc  (unknown)  pthread_kill
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m     @     0x7fa85facd520  (unknown)  (unknown)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [2025-02-03 10:34:03,861 E 4954 6881] logging.cc:365: *** SIGABRT received at time=1738578843 on cpu 55 ***
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [2025-02-03 10:34:03,861 E 4954 6881] logging.cc:365: PC: @     0x7fa85fb219fc  (unknown)  pthread_kill
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [2025-02-03 10:34:03,862 E 4954 6881] logging.cc:365:     @     0x7fa85facd520  (unknown)  (unknown)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m Fatal Python error: Aborted
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m 
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m 
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, cython.cimports.libc.math, Cython.Utils, Cython.Plex.Actions, Cython.Plex.Transitions, Cython.Plex.Machines, Cython.Plex.DFA, Cython.Plex.Scanners, Cython.Compiler.Scanning, Cython.StringIOTree, Cython.Compiler.Code, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, sklearn.__check_build._check_build, scipy.special._ufuncs_cxx, scipy.special._cdflib, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.ndimage._nd_image, _ni_label, scipy.ndimage._ni_label, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.special.cython_special, scipy.stats._stats, scipy.stats.beta_ufunc, scipy.stats._boost.beta_ufunc, scipy.stats.binom_ufunc, scipy.stats._boost.binom_ufunc, scipy.stats.nbinom_ufunc, scipy.stats._boost.nbinom_ufunc, scipy.stats.hypergeom_ufunc, scipy.stats._boost.hypergeom_ufunc, scipy.stats.ncf_ufunc, scipy.stats._boost.ncf_ufunc, scipy.stats.ncx2_ufunc, scipy.stats._boost.ncx2_ufunc, scipy.stats.nct_ufunc, scipy.stats._boost.nct_ufunc, scipy.stats.skewnorm_ufunc, scipy.stats._boost.skewnorm_ufunc, scipy.stats.invgauss_ufunc, scipy.stats._boost.invgauss_ufunc, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._ansari_swilk_statistics, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.stats._unuran.unuran_wrapper, sklearn.utils._isfinite, sklearn.utils.sparsefuncs_fast, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, sklearn.utils._random, markupsafe._speedups, PIL._imaging, sentencepiece._sentencepiece, PIL._imagingft, _cffi_backend, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, pyarrow._acero, pyarrow._csv, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, regex._regex, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, scipy.io.matlab._mio5_utils, msgspec._core (total: 246)
�[33m(raylet)�[0m A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff8acf305c671e4eeab0abcfe303000000 Worker ID: 9ca9358bac209c6139cefb95c790eb48c2b1d3b12fe50c8e14bf3a49 Node ID: 96355bfe3937f7bdc789e8cdb07c03014afe80b3b5c232091e6e7c81 Worker IP address: 10.7.35.246 Worker port: 10005 Worker PID: 4950 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
�[36m(ActorModelRayActorBOX pid=4952, ip=10.7.35.246)�[0m [2025-02-03 09:59:14,059] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step10 is ready now!�[32m [repeated 8x across cluster]�[0m
Traceback (most recent call last):
  File "/scratch/amlt_code/train/openrlhf/cli/train_ppo_ray_box.py", line 395, in <module>
    train(args)
  File "/scratch/amlt_code/train/openrlhf/cli/train_ppo_ray_box.py", line 175, in train
    ray.get(refs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 863, in get_objects
    raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
	class_name: ActorModelRayActorBOX
	actor_id: 8acf305c671e4eeab0abcfe303000000
	pid: 4950
	namespace: 9879d1d0-f0b6-41fa-b242-91c30df9a23c
	ip: 10.7.35.246
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

And I run it on 32 A100 GPUs, I don't change too much of the script:

HDFS_HOME=...
RUN_NAME=Qwen2.5-Math-7B_ppo_from_base_math_lv35

python3 openrlhf/cli/train_ppo_ray_box.py \
    --ref_num_nodes 1 \
    --ref_num_gpus_per_node 8 \
    --reward_num_nodes 0 \
    --reward_num_gpus_per_node 0 \
    --critic_num_nodes 1 \
    --critic_num_gpus_per_node 8 \
    --actor_num_nodes 1 \
    --actor_num_gpus_per_node 8 \
    --vllm_num_engines 16 \
    --vllm_sync_backend=gloo \
    --vllm_tensor_parallel_size 1 \
    --colocate_actor_ref \
    --pretrain Qwen/Qwen2.5-Math-7B \
    --save_path $HDFS_HOME/simplerl_checkpoints/$RUN_NAME \
    --micro_train_batch_size 2 \
    --train_batch_size 128 \
    --micro_rollout_batch_size 2 \
    --rollout_batch_size 1024 \
    --temperature 0.6 \
    --n_samples_per_prompt 8 \
    --max_samples 100000 \
    --max_epochs 1 \
    --num_episodes 20 \
    --prompt_max_len 1024 \
    --generate_max_len 3000 \
    --zero_stage 3 \
    --bf16 \
    --actor_learning_rate 5e-7 \
    --critic_learning_rate 9e-6 \
    --init_kl_coef 0.01 \
    --prompt_data  data/math_level3to5_data_processed_with_qwen_prompt.json \
    --input_key input \
    --normalize_reward \
    --flash_attn \
    --gradient_checkpointing \
    --save_steps 1 \
    --load_checkpoint \
    --use_wandb "..." \
    --wandb_run_name $RUN_NAME \
    --ckpt_path $HDFS_HOME/checkpoints/$RUN_NAME  \
    --max_ckpt_num 20000

@levishen
Copy link

levishen commented Feb 6, 2025

It happens during making experience, and it's pretty random, sometimes happen after the 1st step, sometimes after 10 global steps (where I use vllm_engine=8 rather than 16)

make_experience:  24%|██▍       | 124/512 [03:11<11:33,  1.79s/it]
�[36m(ActorModelRayActorBOX pid=4809, ip=10.7.35.246)�[0m 
make_experience:  24%|██▍       | 125/512 [03:13<11:52,  1.84s/it]
�[36m(ActorModelRayActorBOX pid=4809, ip=10.7.35.246)�[0m 
make_experience:  25%|██▍       | 126/512 [03:15<11:52,  1.85s/it]
�[36m(ActorModelRayActorBOX pid=4951, ip=10.7.35.246)�[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...�[32m [repeated 46x across cluster]�[0m
�[36m(ActorModelRayActorBOX pid=4951, ip=10.7.35.246)�[0m To disable this warning, you can either:�[32m [repeated 46x across cluster]�[0m
�[36m(ActorModelRayActorBOX pid=4951, ip=10.7.35.246)�[0m 	- Avoid using `tokenizers` before the fork if possible�[32m [repeated 46x across cluster]�[0m
�[36m(ActorModelRayActorBOX pid=4951, ip=10.7.35.246)�[0m 	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)�[32m [repeated 46x across cluster]�[0m
�[36m(ActorModelRayActorBOX pid=4809, ip=10.7.35.246)�[0m 
make_experience:  25%|██▍       | 127/512 [03:17<12:01,  1.87s/it]
�[36m(ActorModelRayActorBOX pid=4809, ip=10.7.35.246)�[0m 
make_experience:  25%|██▌       | 128/512 [03:19<12:44,  1.99s/it]
�[36m(ActorModelRayActorBOX pid=4956, ip=10.7.35.246)�[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...�[32m [repeated 47x across cluster]�[0m
�[36m(ActorModelRayActorBOX pid=4956, ip=10.7.35.246)�[0m To disable this warning, you can either:�[32m [repeated 47x across cluster]�[0m
�[36m(ActorModelRayActorBOX pid=4956, ip=10.7.35.246)�[0m 	- Avoid using `tokenizers` before the fork if possible�[32m [repeated 47x across cluster]�[0m
�[36m(ActorModelRayActorBOX pid=4956, ip=10.7.35.246)�[0m 	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)�[32m [repeated 47x across cluster]�[0m
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [rank5]:[E203 10:34:03.416910779 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1495581, OpType=_ALLGATHER_BASE, NumelIn=68124672, NumelOut=544997376, Timeout(ms)=1800000) ran for 1800036 milliseconds before timing out.
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [rank5]:[E203 10:34:03.417077239 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 5] Exception (either an error or timeout) detected by watchdog at work: 1495581, last enqueued NCCL work: 1495583, last completed NCCL work: 1495580.
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [rank5]:[E203 10:34:03.417091806 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 5] Timeout at NCCL work: 1495581, last enqueued NCCL work: 1495583, last completed NCCL work: 1495580.
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [rank5]:[E203 10:34:03.417110852 ProcessGroupNCCL.cpp:630] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [rank5]:[E203 10:34:03.417116102 ProcessGroupNCCL.cpp:636] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [rank5]:[E203 10:34:03.418413023 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1495581, OpType=_ALLGATHER_BASE, NumelIn=68124672, NumelOut=544997376, Timeout(ms)=1800000) ran for 1800036 milliseconds before timing out.
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fa84b4b9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f76ba7aa772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f76ba7b1bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f76ba7b361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #4: <unknown function> + 0xdc253 (0x7fa85d8aa253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #5: <unknown function> + 0x94ac3 (0x7fa85fb1fac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #6: <unknown function> + 0x126850 (0x7fa85fbb1850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m 
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [2025-02-03 10:34:03,841 E 4954 6881] logging.cc:101: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG ID 0 PG GUID 0(default_pg) Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1495581, OpType=_ALLGATHER_BASE, NumelIn=68124672, NumelOut=544997376, Timeout(ms)=1800000) ran for 1800036 milliseconds before timing out.
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fa84b4b9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f76ba7aa772 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f76ba7b1bb3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f76ba7b361d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #4: <unknown function> + 0xdc253 (0x7fa85d8aa253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #5: <unknown function> + 0x94ac3 (0x7fa85fb1fac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #6: <unknown function> + 0x126850 (0x7fa85fbb1850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m 
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fa84b4b9446 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #1: <unknown function> + 0xe4271b (0x7f76ba42071b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #2: <unknown function> + 0xdc253 (0x7fa85d8aa253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #3: <unknown function> + 0x94ac3 (0x7fa85fb1fac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m frame #4: <unknown function> + 0x126850 (0x7fa85fbb1850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m 
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [2025-02-03 10:34:03,861 E 4954 6881] logging.cc:108: Stack trace: 
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m  /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xfe573a) [0x7fa85e9e473a] ray::operator<<()
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xfe81f8) [0x7fa85e9e71f8] ray::TerminateHandler()
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7fa85d87c20c]
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7fa85d87c277]
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7fa85d87c1fe]
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe427c9) [0x7f76ba4207c9] c10d::ProcessGroupNCCL::ncclCommWatchdog()
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fa85d8aa253]
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fa85fb1fac3]
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fa85fbb1850]
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m 
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m *** SIGABRT received at time=1738578843 on cpu 55 ***
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m PC: @     0x7fa85fb219fc  (unknown)  pthread_kill
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m     @     0x7fa85facd520  (unknown)  (unknown)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [2025-02-03 10:34:03,861 E 4954 6881] logging.cc:365: *** SIGABRT received at time=1738578843 on cpu 55 ***
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [2025-02-03 10:34:03,861 E 4954 6881] logging.cc:365: PC: @     0x7fa85fb219fc  (unknown)  pthread_kill
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m [2025-02-03 10:34:03,862 E 4954 6881] logging.cc:365:     @     0x7fa85facd520  (unknown)  (unknown)
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m Fatal Python error: Aborted
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m 
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m 
�[36m(ActorModelRayActorBOX pid=4954, ip=10.7.35.246)�[0m Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, cython.cimports.libc.math, Cython.Utils, Cython.Plex.Actions, Cython.Plex.Transitions, Cython.Plex.Machines, Cython.Plex.DFA, Cython.Plex.Scanners, Cython.Compiler.Scanning, Cython.StringIOTree, Cython.Compiler.Code, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, sklearn.__check_build._check_build, scipy.special._ufuncs_cxx, scipy.special._cdflib, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.ndimage._nd_image, _ni_label, scipy.ndimage._ni_label, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.special.cython_special, scipy.stats._stats, scipy.stats.beta_ufunc, scipy.stats._boost.beta_ufunc, scipy.stats.binom_ufunc, scipy.stats._boost.binom_ufunc, scipy.stats.nbinom_ufunc, scipy.stats._boost.nbinom_ufunc, scipy.stats.hypergeom_ufunc, scipy.stats._boost.hypergeom_ufunc, scipy.stats.ncf_ufunc, scipy.stats._boost.ncf_ufunc, scipy.stats.ncx2_ufunc, scipy.stats._boost.ncx2_ufunc, scipy.stats.nct_ufunc, scipy.stats._boost.nct_ufunc, scipy.stats.skewnorm_ufunc, scipy.stats._boost.skewnorm_ufunc, scipy.stats.invgauss_ufunc, scipy.stats._boost.invgauss_ufunc, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._ansari_swilk_statistics, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.stats._unuran.unuran_wrapper, sklearn.utils._isfinite, sklearn.utils.sparsefuncs_fast, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, sklearn.utils._random, markupsafe._speedups, PIL._imaging, sentencepiece._sentencepiece, PIL._imagingft, _cffi_backend, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, pyarrow._acero, pyarrow._csv, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, regex._regex, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, scipy.io.matlab._mio5_utils, msgspec._core (total: 246)
�[33m(raylet)�[0m A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff8acf305c671e4eeab0abcfe303000000 Worker ID: 9ca9358bac209c6139cefb95c790eb48c2b1d3b12fe50c8e14bf3a49 Node ID: 96355bfe3937f7bdc789e8cdb07c03014afe80b3b5c232091e6e7c81 Worker IP address: 10.7.35.246 Worker port: 10005 Worker PID: 4950 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
�[36m(ActorModelRayActorBOX pid=4952, ip=10.7.35.246)�[0m [2025-02-03 09:59:14,059] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step10 is ready now!�[32m [repeated 8x across cluster]�[0m
Traceback (most recent call last):
  File "/scratch/amlt_code/train/openrlhf/cli/train_ppo_ray_box.py", line 395, in <module>
    train(args)
  File "/scratch/amlt_code/train/openrlhf/cli/train_ppo_ray_box.py", line 175, in train
    ray.get(refs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 863, in get_objects
    raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
	class_name: ActorModelRayActorBOX
	actor_id: 8acf305c671e4eeab0abcfe303000000
	pid: 4950
	namespace: 9879d1d0-f0b6-41fa-b242-91c30df9a23c
	ip: 10.7.35.246
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

And I run it on 32 A100 GPUs, I don't change too much of the script:

HDFS_HOME=...
RUN_NAME=Qwen2.5-Math-7B_ppo_from_base_math_lv35

python3 openrlhf/cli/train_ppo_ray_box.py \
    --ref_num_nodes 1 \
    --ref_num_gpus_per_node 8 \
    --reward_num_nodes 0 \
    --reward_num_gpus_per_node 0 \
    --critic_num_nodes 1 \
    --critic_num_gpus_per_node 8 \
    --actor_num_nodes 1 \
    --actor_num_gpus_per_node 8 \
    --vllm_num_engines 16 \
    --vllm_sync_backend=gloo \
    --vllm_tensor_parallel_size 1 \
    --colocate_actor_ref \
    --pretrain Qwen/Qwen2.5-Math-7B \
    --save_path $HDFS_HOME/simplerl_checkpoints/$RUN_NAME \
    --micro_train_batch_size 2 \
    --train_batch_size 128 \
    --micro_rollout_batch_size 2 \
    --rollout_batch_size 1024 \
    --temperature 0.6 \
    --n_samples_per_prompt 8 \
    --max_samples 100000 \
    --max_epochs 1 \
    --num_episodes 20 \
    --prompt_max_len 1024 \
    --generate_max_len 3000 \
    --zero_stage 3 \
    --bf16 \
    --actor_learning_rate 5e-7 \
    --critic_learning_rate 9e-6 \
    --init_kl_coef 0.01 \
    --prompt_data  data/math_level3to5_data_processed_with_qwen_prompt.json \
    --input_key input \
    --normalize_reward \
    --flash_attn \
    --gradient_checkpointing \
    --save_steps 1 \
    --load_checkpoint \
    --use_wandb "..." \
    --wandb_run_name $RUN_NAME \
    --ckpt_path $HDFS_HOME/checkpoints/$RUN_NAME  \
    --max_ckpt_num 20000

me too

@ypwang61
Copy link
Author

ypwang61 commented Feb 6, 2025

I just rerun the experiment, it will start from the latest checkpoint

@levishen
Copy link

levishen commented Feb 6, 2025

make_experience: 2%|▏ | 50/2048 [02:49<1:32:39, 2.78s/it]�[A
�[36m(ActorModelRayActorBOX pid=12753)�[0m
�[36m(ActorModelRayActorBOX pid=12753)�[0m
make_experience: 2%|▏ | 51/2048 [02:50<1:21:38, 2.45s/it]�[A
�[36m(ActorModelRayActorBOX pid=12903)�[0m [rank1]:[E206 16:45:19.236012277 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5919, OpType=_ALLGATHER_BASE, NumelIn=272498688, NumelOut=544997376, Timeout(ms)=1800000) ran for 1800098 milliseconds before timing out.
�[36m(ActorModelRayActorBOX pid=12903)�[0m [rank1]:[E206 16:45:19.244433255 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 5919, last enqueued NCCL work: 5921, last completed NCCL work: 5918.
�[36m(ActorModelRayActorBOX pid=12903)�[0m [rank1]:[E206 16:45:19.244459862 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 5919, last enqueued NCCL work: 5921, last completed NCCL work: 5918.
�[36m(ActorModelRayActorBOX pid=12903)�[0m [rank1]:[E206 16:45:19.244471319 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
�[36m(ActorModelRayActorBOX pid=12903)�[0m [rank1]:[E206 16:45:19.244477656 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
�[36m(ActorModelRayActorBOX pid=12903)�[0m [rank1]:[E206 16:45:19.249997119 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5919, OpType=_ALLGATHER_BASE, NumelIn=272498688, NumelOut=544997376, Timeout(ms)=1800000) ran for 1800098 milliseconds before timing out.
�[36m(ActorModelRayActorBOX pid=12903)�[0m Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f0324c4f446 in /nfs/ofs-llm-ssd/user/shenlibin/env/openrlhf/lib/python3.10/site-packages/torch/lib/libc10.so)
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7ed23214fa92 in /nfs/ofs-llm-ssd/user/shenlibin/env/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ed232156ed3 in /nfs/ofs-llm-ssd/user/shenlibin/env/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ed23215893d in /nfs/ofs-llm-ssd/user/shenlibin/env/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #4: + 0xd6df4 (0x7f0333b91df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #5: + 0x8609 (0x7f0335f22609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #6: clone + 0x43 (0x7f0335ced353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(ActorModelRayActorBOX pid=12903)�[0m
�[36m(ActorModelRayActorBOX pid=12903)�[0m [2025-02-06 16:45:19,053 E 12903 13435] logging.cc:101: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5919, OpType=_ALLGATHER_BASE, NumelIn=272498688, NumelOut=544997376, Timeout(ms)=1800000) ran for 1800098 milliseconds before timing out.
�[36m(ActorModelRayActorBOX pid=12903)�[0m Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f0324c4f446 in /nfs/ofs-llm-ssd/user/shenlibin/env/openrlhf/lib/python3.10/site-packages/torch/lib/libc10.so)
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7ed23214fa92 in /nfs/ofs-llm-ssd/user/shenlibin/env/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ed232156ed3 in /nfs/ofs-llm-ssd/user/shenlibin/env/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ed23215893d in /nfs/ofs-llm-ssd/user/shenlibin/env/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #4: + 0xd6df4 (0x7f0333b91df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #5: + 0x8609 (0x7f0335f22609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #6: clone + 0x43 (0x7f0335ced353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(ActorModelRayActorBOX pid=12903)�[0m
�[36m(ActorModelRayActorBOX pid=12903)�[0m Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f0324c4f446 in /nfs/ofs-llm-ssd/user/shenlibin/env/openrlhf/lib/python3.10/site-packages/torch/lib/libc10.so)
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #1: + 0xe7eb1b (0x7ed231dcdb1b in /nfs/ofs-llm-ssd/user/shenlibin/env/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #2: + 0xd6df4 (0x7f0333b91df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #3: + 0x8609 (0x7f0335f22609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
�[36m(ActorModelRayActorBOX pid=12903)�[0m frame #4: clone + 0x43 (0x7f0335ced353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
�[36m(ActorModelRayActorBOX pid=12903)�[0m
�[36m(ActorModelRayActorBOX pid=12903)�[0m [2025-02-06 16:45:19,245 E 12903 13435] logging.cc:108: Stack trace:
�[36m(ActorModelRayActorBOX pid=12903)�[0m /nfs/ofs-llm-ssd/user/shenlibin/env/openrlhf/lib/python3.10/site-packages/ray/_raylet.so(+0xfe573a) [0x7f0334c8f73a] ray::operator<<()
�[36m(ActorModelRayActorBOX pid=12903)�[0m /nfs/ofs-llm-ssd/user/shenlibin/env/openrlhf/lib/python3.10/site-packages/ray/_raylet.so(+0xfe81f8) [0x7f0334c921f8] ray::TerminateHandler()
�[36m(ActorModelRayActorBOX pid=12903)�[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7f0333b6537c]
�[36m(ActorModelRayActorBOX pid=12903)�[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7f0333b653e7]
�[36m(ActorModelRayActorBOX pid=12903)�[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa36f) [0x7f0333b6536f]
�[36m(ActorModelRayActorBOX pid=12903)�[0m /nfs/ofs-llm-ssd/user/shenlibin/env/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xe7ebc9) [0x7ed231dcdbc9] c10d::ProcessGroupNCCL::ncclCommWatchdog()
�[36m(ActorModelRayActorBOX pid=12903)�[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7f0333b91df4]
�[36m(ActorModelRayActorBOX pid=12903)�[0m /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f0335f22609] start_thread
�[36m(ActorModelRayActorBOX pid=12903)�[0m /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f0335ced353] __clone
�[36m(ActorModelRayActorBOX pid=12903)�[0m
�[36m(ActorModelRayActorBOX pid=12903)�[0m *** SIGABRT received at time=1738831519 on cpu 9 ***
�[36m(ActorModelRayActorBOX pid=12903)�[0m PC: @ 0x7f0335c1100b (unknown) raise
�[36m(ActorModelRayActorBOX pid=12903)�[0m @ 0x7f0335f2e420 4048 (unknown)
�[36m(ActorModelRayActorBOX pid=12903)�[0m @ 0x7f0333b6537c (unknown) (unknown)
�[36m(ActorModelRayActorBOX pid=12903)�[0m @ 0x7f0333b65090 (unknown) (unknown)
�[36m(ActorModelRayActorBOX pid=12903)�[0m [2025-02-06 16:45:19,247 E 12903 13435] logging.cc:365: *** SIGABRT received at time=1738831519 on cpu 9 ***
�[36m(ActorModelRayActorBOX pid=12903)�[0m [2025-02-06 16:45:19,247 E 12903 13435] logging.cc:365: PC: @ 0x7f0335c1100b (unknown) raise
�[36m(ActorModelRayActorBOX pid=12903)�[0m [2025-02-06 16:45:19,247 E 12903 13435] logging.cc:365: @ 0x7f0335f2e420 4048 (unknown)
�[36m(ActorModelRayActorBOX pid=12903)�[0m [2025-02-06 16:45:19,247 E 12903 13435] logging.cc:365: @ 0x7f0333b6537c (unknown) (unknown)
�[36m(ActorModelRayActorBOX pid=12903)�[0m [2025-02-06 16:45:19,247 E 12903 13435] logging.cc:365: @ 0x7f0333b65090 (unknown) (unknown)
�[36m(ActorModelRayActorBOX pid=12903)�[0m Fatal Python error: Aborted
�[36m(ActorModelRayActorBOX pid=12903)�[0m
�[36m(ActorModelRayActorBOX pid=12903)�[0m
�[36m(ActorModelRayActorBOX pid=12903)�[0m Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, zstandard.backend_c, charset_normalizer.md, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, markupsafe._speedups, PIL._imaging, sklearn.__check_build._check_build, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.linalg._flinalg, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.ndimage._nd_image, _ni_label, scipy.ndimage._ni_label, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize.__nnls, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.special.cython_special, scipy.stats._stats, scipy.stats.beta_ufunc, scipy.stats._boost.beta_ufunc, scipy.stats.binom_ufunc, scipy.stats._boost.binom_ufunc, scipy.stats.nbinom_ufunc, scipy.stats._boost.nbinom_ufunc, scipy.stats.hypergeom_ufunc, scipy.stats._boost.hypergeom_ufunc, scipy.stats.ncf_ufunc, scipy.stats._boost.ncf_ufunc, scipy.stats.ncx2_ufunc, scipy.stats._boost.ncx2_ufunc, scipy.stats.nct_ufunc, scipy.stats._boost.nct_ufunc, scipy.stats.skewnorm_ufunc, scipy.stats._boost.skewnorm_ufunc, scipy.stats.invgauss_ufunc, scipy.stats._boost.invgauss_ufunc, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._statlib, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, scipy.stats._rcont.rcont, sklearn.utils._isfinite, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.utils.sparsefuncs_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, PIL._imagingft, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pandas._libs.tslib, pandas._libs.ops, numexpr.interpreter, pyarrow._compute, pandas._libs.arrays, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, regex._regex, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, scipy.io.matlab._mio5_utils, msgspec._core, sentencepiece._sentencepiece, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils (total: 233)
�[33m(raylet)�[0m A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffc2deec0a1a0d9238b661375502000000 Worker ID: 05f5d057dff9907bb342191ffa7158ac18d5d1ebce7e7ecde8bb8b16 Node ID: 34fc6ceb89bf93846faa55035dbde5ecf710b65a8777233a85441e49 Worker IP address: 0.0.0.0 Worker port: 10136 Worker PID: 12903 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
�[36m(LLMRayActor pid=13858)�[0m INFO 02-06 15:59:51 model_runner.py:1563] Graph capturing finished in 18 secs, took 0.78 GiB
�[36m(LLMRayActor pid=13858)�[0m INFO 02-06 15:59:51 llm_engine.py:429] init engine (profile, create kv cache, warmup model) took 23.02 seconds
�[36m(LLMRayActor pid=13858)�[0m init_process_group: master_address=0.0.0.0, master_port=41339, rank=2, world_size=3, group_name=openrlhf
Traceback (most recent call last):
File "/nfs/ofs-llm-ssd/user/shenlibin/simpleRL/simpleRL-reason-main/train/openrlhf/cli/train_ppo_ray_box.py", line 395, in
train(args)
File "/nfs/ofs-llm-ssd/user/shenlibin/simpleRL/simpleRL-reason-main/train/openrlhf/cli/train_ppo_ray_box.py", line 175, in train
ray.get(refs)
File "/nfs/ofs-llm-ssd/user/shenlibin/env/openrlhf/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/nfs/ofs-llm-ssd/user/shenlibin/env/openrlhf/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/nfs/ofs-llm-ssd/user/shenlibin/env/openrlhf/lib/python3.10/site-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/nfs/ofs-llm-ssd/user/shenlibin/env/openrlhf/lib/python3.10/site-packages/ray/_private/worker.py", line 863, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: ActorModelRayActorBOX
actor_id: c2deec0a1a0d9238b661375502000000
pid: 12903
namespace: e56aa1f7-b19b-4947-9fd5-a72beb088786
ip: 0.0.0.0
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2025-02-06 16:45:36,354 ERR cli.py:68 -- �[31m---------------------------------------�[39m
2025-02-06 16:45:36,354 ERR cli.py:69 -- �[31mJob 'raysubmit_TxJ1Zkh4WQdA664e' failed�[39m
2025-02-06 16:45:36,354 ERR cli.py:70 -- �[31m---------------------------------------�[39m
2025-02-06 16:45:36,354 INFO cli.py:83 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/nfs/ofs-llm-ssd/user/shenlibin/env/openrlhf/lib/python3.10/site-packages/ray/_private/worker.py", line 863, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: ActorModelRayActorBOX
actor_id: c2deec0a1a0d9238b661375502000000
pid: 12903
namespace: e56aa1f7-b19b-4947-9fd5-a72beb088786
ip: 0.0.0.0
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

@Ziwei-Zheng
Copy link

same problem. I run it on 4 nodes with 32 gpus.

@to1a
Copy link

to1a commented Feb 11, 2025

Pretty same issue. When I train Qwen-math-base on MATH 3-5 dataset with 5 nodes H20s it works well, but when I change checkpoint (base+sft) or dataset(Omni) it just randomly dies(some time in step 1, some time in step 10)
I tried

export RAY_memory_usage_threshold=0.95 # or 0.8
export RAY_memory_monitor_refresh_ms=0

And this not work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants