-
Notifications
You must be signed in to change notification settings - Fork 39
Description
Following discussion on the G-ADOPT github page, I am porting this issue across to pyadjoint, on the advice of @Ig-dolci. For background, G-ADOPT discussion here -- g-adopt/g-adopt#190.
I have tested a simpler reproducer of our typical mantle convection adjoint cases in 2-D, where we invert for an unknown initial condition. Good news and bad news:
- If I run a standard Taylor test for our typical style of case with all default checkpointing options (i.e. in memory -- nothing written to disk), I get convergence rates of 2.0 as expected. Good!
- If I take this case and add SingleMemoryStorageSchedule, convergence remains at 2.0, but there is a substantial speedup. Again good!
So the SingleMemoryStorageSchedule seems to behave as expected. The issues arise with disk checkpointing:
- If I run a standard Taylor test, now with only "enable_disk_checkpointing" triggered, convergence remains at 2.0.
- If I modify this case to also use SingleDiskStorageSchedule as is advised, my convergence rate drops. In the reproducer I have, it drops to ~1.9, but I have seen some other cases go below 1.3.
This issue with SingleDiskCheckpointSchedule appears in both serial and parallel runs, but is pretty difficult to reproduce via a minimal reproducer. I have tried all day with no luck. With that in mind I share my setup here:
Serial -- Disk Checkpointing Enabled with SingleDiskCheckpointingSchedule: https://www.dropbox.com/scl/fi/6gmurfrlkj2hairiscijf/Taylor_Serial_SingleDiskCheckpointingSchedule.tar.gz?rlkey=lcs4slwgc12fshbw5s24xsch9
Serial -- Disk Checkpointing Enabled with no Checkpoint Schedule specified:
https://www.dropbox.com/scl/fi/z7e1d1xym6kkuzi8iwayx/Taylor_Serial_EnableDiskCheckpointing.tar.gz?rlkey=myd05wssurg808uurfrozp5pd
Serial -- SingleMemoryCheckpointingSchedule: https://www.dropbox.com/scl/fi/e7puz36orfbwn2oq1p6mj/Taylor_Serial_SingleMemoryCheckpointingSchedule.tar.gz?rlkey=gfby5hum7qmyewkhih1fv5up2
Note convergence rates are printed at the end of the output.dat files in each folder. Aside from the changes to memory/checkpointing, cases are otherwise identical.
To run these case it's as simple as: python inverse.py &> output.dat -- they pick up from a checkpoint, which is also included in each folder.
For me cases take roughly an hour to run. I know this isn't ideal -- but attempts to reproduce in a simpler setup have failed (which I find difficult to understand!).