Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HREX with plumed 2.7.2 + gromacs 2020/2021, stalls at random checkpoint writing #742

Open
simonlichtinger opened this issue Oct 4, 2021 · 3 comments

Comments

@simonlichtinger
Copy link

Dear plumed dev team,

While trying to run REST2 via hrex, I'm having the following issue.

At a random point (anything between 300ps to 3ns I have observed) into the simulation, gromacs stalls on writing a set of checkpoint files. This means that the main task is still running but drops to very low CPU usage (and output via the -v flag freezes), some replicas have already written the new checkpoint file while others haven't. Notably, this happens after several checkpoint updates have already succeeded.

I've tried troubleshooting with:

  • Testing different gromacs versions (occurs for 2020.4 and 2021.3)
  • Making sure there is enough disk space (there is)
  • Running on different architectures (occurs for my local machine - 2xGTX 3060, CUDA and AVX_512, as well as a cluster - Tesla V100, CUDA and IBM_VSX)
  • Different openmpi versions, occurs for 4.0.5 and 4.1.1
  • Different plumed versions (can only use 2.7.2, as hrex will fail at first exchange attempt with mpi error in versions 2.7.1 or 2.6)

I'm invoking gromacs via mpirun -np 4 gmx_mpi mdrun -v -deffnm topol -multidir run* -replex 100 -hrex -plumed plumed.dat (with an empty plumed.dat file).

Is this an issue you are aware of? Might you have any idea what causes it?

Many thanks
Simon

@MauriceKarrenbrock
Copy link

Hi, I am having the same problem (gromacs 2021.3 plumed 2.7.2)

If I force gromacs to never generate checkpoint files (by using -cpt -1 in mdrun) the jobs arrive to the end without problems. (But of course this "solution" is not feasible for MD runs that take longer than the maximum wall time of ones HPC cluster)

@insukjoung
Copy link

Hi,
I have the same problem. (version 2021.4-plumed-2.7.3)
CUDA 11.2
mpich 3.4.3
GPU: Tesla V100
After writing the checkpoint file, it hangs.
Once it happens, the generated cpt file will have a file name like xxxx_step154480.cpt. But the name given by -cpi is xxxx.cpt. Unable to restart with problematic cpt file. I'm not sure the information will help you debug.

@GiovanniBussi
Copy link
Member

GiovanniBussi commented Sep 16, 2022

Closed by #831

(still working on the 2020 patch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants