Issue with DiskCheckpointing via SingleDiskStorageSchedule

Following discussion on the G-ADOPT github page, I am porting this issue across to pyadjoint, on the advice of @Ig-dolci. For background, G-ADOPT discussion here -- https://github.com/g-adopt/g-adopt/discussions/190.

I have tested a simpler reproducer of our typical mantle convection adjoint cases in 2-D, where we invert for an unknown initial condition. Good news and bad news:

1. If I run a standard Taylor test for our typical style of case with all default checkpointing options (i.e. in memory -- nothing written to disk), I get convergence rates of 2.0 as expected. Good!
2. If I take this case and add SingleMemoryStorageSchedule, convergence remains at 2.0, but there is a substantial speedup. Again good!

So the SingleMemoryStorageSchedule seems to behave as expected. The issues arise with disk checkpointing:

3. If I run a standard Taylor test, now with only "enable_disk_checkpointing" triggered, convergence remains at 2.0.
4. If I modify this case to also use SingleDiskStorageSchedule as is advised, my convergence rate drops. In the reproducer I have, it drops to ~1.9, but I have seen some other cases go below 1.3.

This issue with SingleDiskCheckpointSchedule appears in both serial and parallel runs, but is pretty difficult to reproduce via a minimal reproducer.  I have tried all day with no luck. With that in mind I share my setup here:

Serial -- Disk Checkpointing Enabled with SingleDiskCheckpointingSchedule: https://www.dropbox.com/scl/fi/6gmurfrlkj2hairiscijf/Taylor_Serial_SingleDiskCheckpointingSchedule.tar.gz?rlkey=lcs4slwgc12fshbw5s24xsch9

Serial -- Disk Checkpointing Enabled with no Checkpoint Schedule specified:
https://www.dropbox.com/scl/fi/z7e1d1xym6kkuzi8iwayx/Taylor_Serial_EnableDiskCheckpointing.tar.gz?rlkey=myd05wssurg808uurfrozp5pd

Serial -- SingleMemoryCheckpointingSchedule: https://www.dropbox.com/scl/fi/e7puz36orfbwn2oq1p6mj/Taylor_Serial_SingleMemoryCheckpointingSchedule.tar.gz?rlkey=gfby5hum7qmyewkhih1fv5up2

Note convergence rates are printed at the end of the output.dat files in each folder. Aside from the changes to memory/checkpointing, cases are otherwise identical.

To run these case it's as simple as: `python inverse.py &> output.dat` -- they pick up from a checkpoint, which is also included in each folder.

For me cases take roughly an hour to run. I know this isn't ideal -- but attempts to reproduce in a simpler setup have failed (which I find difficult to understand!).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue with DiskCheckpointing via SingleDiskStorageSchedule #200

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue with DiskCheckpointing via SingleDiskStorageSchedule #200

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions