[petsc-users] TSAdjoint multilevel checkpointing running out of memory

Zhang, Hong hongzhang at anl.gov
Tue Dec 8 17:47:50 CST 2020


Anton,

TSAdjoint should manage checkpointing automatically, and the number of checkpoints in RAM and disk should not exceed the user-specified values. Can you send us the output for -ts_trajectory_monitor in your case?

Hong (Mr.)

On Dec 8, 2020, at 3:37 PM, Anton Glazkov <anton.glazkov at chch.ox.ac.uk<mailto:anton.glazkov at chch.ox.ac.uk>> wrote:

Good evening,

I’m attempting to run a multi-level checkpointing code on a cluster (ie RAM+disk storage with –download-revolve as a configure option) with the options “-ts_trajectory_type memory -ts_trajectory_max_cps_ram 5 -ts_trajectory_max_cps_disk 5000”, for example. My question is, if I have 100,000 time points, for example, that need to be evaluated  during the forward and adjoint run, does TSAdjoint automatically optimize the checkpointing so that the number of checkpoints in RAM and disk do not exceed these values, or is one of the options ignored. I ask because I have a case that runs correctly with -ts_trajectory_type basic, but runs out of memory when attempting to fill the checkpoints in RAM when running the adjoint (I have verified that 5 checkpoints will actually fit into the available memory). This makes me think that maybe the -ts_trajectory_max_cps_ram 5 option is being ignored?

Best wishes,
Anton

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20201208/6d0c041d/attachment-0001.html>


More information about the petsc-users mailing list