Skip to content

Issue with DiskCheckpointing via SingleDiskStorageSchedule #200

@drhodrid

Description

@drhodrid

Following discussion on the G-ADOPT github page, I am porting this issue across to pyadjoint, on the advice of @Ig-dolci. For background, G-ADOPT discussion here -- g-adopt/g-adopt#190.

I have tested a simpler reproducer of our typical mantle convection adjoint cases in 2-D, where we invert for an unknown initial condition. Good news and bad news:

  1. If I run a standard Taylor test for our typical style of case with all default checkpointing options (i.e. in memory -- nothing written to disk), I get convergence rates of 2.0 as expected. Good!
  2. If I take this case and add SingleMemoryStorageSchedule, convergence remains at 2.0, but there is a substantial speedup. Again good!

So the SingleMemoryStorageSchedule seems to behave as expected. The issues arise with disk checkpointing:

  1. If I run a standard Taylor test, now with only "enable_disk_checkpointing" triggered, convergence remains at 2.0.
  2. If I modify this case to also use SingleDiskStorageSchedule as is advised, my convergence rate drops. In the reproducer I have, it drops to ~1.9, but I have seen some other cases go below 1.3.

This issue with SingleDiskCheckpointSchedule appears in both serial and parallel runs, but is pretty difficult to reproduce via a minimal reproducer. I have tried all day with no luck. With that in mind I share my setup here:

Serial -- Disk Checkpointing Enabled with SingleDiskCheckpointingSchedule: https://www.dropbox.com/scl/fi/6gmurfrlkj2hairiscijf/Taylor_Serial_SingleDiskCheckpointingSchedule.tar.gz?rlkey=lcs4slwgc12fshbw5s24xsch9

Serial -- Disk Checkpointing Enabled with no Checkpoint Schedule specified:
https://www.dropbox.com/scl/fi/z7e1d1xym6kkuzi8iwayx/Taylor_Serial_EnableDiskCheckpointing.tar.gz?rlkey=myd05wssurg808uurfrozp5pd

Serial -- SingleMemoryCheckpointingSchedule: https://www.dropbox.com/scl/fi/e7puz36orfbwn2oq1p6mj/Taylor_Serial_SingleMemoryCheckpointingSchedule.tar.gz?rlkey=gfby5hum7qmyewkhih1fv5up2

Note convergence rates are printed at the end of the output.dat files in each folder. Aside from the changes to memory/checkpointing, cases are otherwise identical.

To run these case it's as simple as: python inverse.py &> output.dat -- they pick up from a checkpoint, which is also included in each folder.

For me cases take roughly an hour to run. I know this isn't ideal -- but attempts to reproduce in a simpler setup have failed (which I find difficult to understand!).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions