You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While trying to run REST2 via hrex, I'm having the following issue.
At a random point (anything between 300ps to 3ns I have observed) into the simulation, gromacs stalls on writing a set of checkpoint files. This means that the main task is still running but drops to very low CPU usage (and output via the -v flag freezes), some replicas have already written the new checkpoint file while others haven't. Notably, this happens after several checkpoint updates have already succeeded.
I've tried troubleshooting with:
Testing different gromacs versions (occurs for 2020.4 and 2021.3)
Making sure there is enough disk space (there is)
Running on different architectures (occurs for my local machine - 2xGTX 3060, CUDA and AVX_512, as well as a cluster - Tesla V100, CUDA and IBM_VSX)
Different openmpi versions, occurs for 4.0.5 and 4.1.1
Different plumed versions (can only use 2.7.2, as hrex will fail at first exchange attempt with mpi error in versions 2.7.1 or 2.6)
Hi, I am having the same problem (gromacs 2021.3 plumed 2.7.2)
If I force gromacs to never generate checkpoint files (by using -cpt -1 in mdrun) the jobs arrive to the end without problems. (But of course this "solution" is not feasible for MD runs that take longer than the maximum wall time of ones HPC cluster)
Hi,
I have the same problem. (version 2021.4-plumed-2.7.3)
CUDA 11.2
mpich 3.4.3
GPU: Tesla V100
After writing the checkpoint file, it hangs.
Once it happens, the generated cpt file will have a file name like xxxx_step154480.cpt. But the name given by -cpi is xxxx.cpt. Unable to restart with problematic cpt file. I'm not sure the information will help you debug.
Dear plumed dev team,
While trying to run REST2 via hrex, I'm having the following issue.
At a random point (anything between 300ps to 3ns I have observed) into the simulation, gromacs stalls on writing a set of checkpoint files. This means that the main task is still running but drops to very low CPU usage (and output via the -v flag freezes), some replicas have already written the new checkpoint file while others haven't. Notably, this happens after several checkpoint updates have already succeeded.
I've tried troubleshooting with:
I'm invoking gromacs via
mpirun -np 4 gmx_mpi mdrun -v -deffnm topol -multidir run* -replex 100 -hrex -plumed plumed.dat
(with an empty plumed.dat file).Is this an issue you are aware of? Might you have any idea what causes it?
Many thanks
Simon
The text was updated successfully, but these errors were encountered: