HREX with plumed 2.7.2 + gromacs 2020/2021, stalls at random checkpoint writing #742

simonlichtinger · 2021-10-04T15:32:32Z

Dear plumed dev team,

While trying to run REST2 via hrex, I'm having the following issue.

At a random point (anything between 300ps to 3ns I have observed) into the simulation, gromacs stalls on writing a set of checkpoint files. This means that the main task is still running but drops to very low CPU usage (and output via the -v flag freezes), some replicas have already written the new checkpoint file while others haven't. Notably, this happens after several checkpoint updates have already succeeded.

I've tried troubleshooting with:

Testing different gromacs versions (occurs for 2020.4 and 2021.3)
Making sure there is enough disk space (there is)
Running on different architectures (occurs for my local machine - 2xGTX 3060, CUDA and AVX_512, as well as a cluster - Tesla V100, CUDA and IBM_VSX)
Different openmpi versions, occurs for 4.0.5 and 4.1.1
Different plumed versions (can only use 2.7.2, as hrex will fail at first exchange attempt with mpi error in versions 2.7.1 or 2.6)

I'm invoking gromacs via mpirun -np 4 gmx_mpi mdrun -v -deffnm topol -multidir run* -replex 100 -hrex -plumed plumed.dat (with an empty plumed.dat file).

Is this an issue you are aware of? Might you have any idea what causes it?

Many thanks
Simon

The text was updated successfully, but these errors were encountered:

MauriceKarrenbrock · 2021-10-12T09:58:06Z

Hi, I am having the same problem (gromacs 2021.3 plumed 2.7.2)

If I force gromacs to never generate checkpoint files (by using -cpt -1 in mdrun) the jobs arrive to the end without problems. (But of course this "solution" is not feasible for MD runs that take longer than the maximum wall time of ones HPC cluster)

insukjoung · 2022-01-12T07:31:41Z

Hi,
I have the same problem. (version 2021.4-plumed-2.7.3)
CUDA 11.2
mpich 3.4.3
GPU: Tesla V100
After writing the checkpoint file, it hangs.
Once it happens, the generated cpt file will have a file name like xxxx_step154480.cpt. But the name given by -cpi is xxxx.cpt. Unable to restart with problematic cpt file. I'm not sure the information will help you debug.

GiovanniBussi · 2022-09-16T13:43:38Z

Closed by #831

(still working on the 2020 patch)

GiovanniBussi mentioned this issue Jun 7, 2022

Hamiltonian replica exchange simulation might hang at gromacs checkpoint writing #829

Closed

GiovanniBussi closed this as completed Sep 16, 2022

GiovanniBussi reopened this Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HREX with plumed 2.7.2 + gromacs 2020/2021, stalls at random checkpoint writing #742

HREX with plumed 2.7.2 + gromacs 2020/2021, stalls at random checkpoint writing #742

simonlichtinger commented Oct 4, 2021

MauriceKarrenbrock commented Oct 12, 2021

insukjoung commented Jan 12, 2022

GiovanniBussi commented Sep 16, 2022 •

edited

Loading

HREX with plumed 2.7.2 + gromacs 2020/2021, stalls at random checkpoint writing #742

HREX with plumed 2.7.2 + gromacs 2020/2021, stalls at random checkpoint writing #742

Comments

simonlichtinger commented Oct 4, 2021

MauriceKarrenbrock commented Oct 12, 2021

insukjoung commented Jan 12, 2022

GiovanniBussi commented Sep 16, 2022 • edited Loading

GiovanniBussi commented Sep 16, 2022 •

edited

Loading