Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Growth and Unexpected Behaviour in Firedrake Adjoint #4014

Open
sghelichkhani opened this issue Feb 7, 2025 · 4 comments · May be fixed by #4020
Open

Memory Growth and Unexpected Behaviour in Firedrake Adjoint #4014

sghelichkhani opened this issue Feb 7, 2025 · 4 comments · May be fixed by #4020
Assignees
Labels

Comments

@sghelichkhani
Copy link
Contributor

We are debugging a large-scale Stokes optimisation problem that eventually runs out of memory (g-adopt/g-adopt#160). Since we are dealing with millions of degrees of freedom, we rely on checkpointing to disk to manage memory. While testing smaller reproducer cases, we see unexpected memory growth even after the tape is generated and throughout forward and backward passes.

What We Expected vs. What’s Happening
Expected: Once the tape is populated, and after the first calls to ReducedFunctional.__call__ and ReducedFunctional.derivative, memory usage should stay constant.
Actual: Memory keeps increasing with every forward and derivative call and steadily.
• In our minimal reproducer, checkpointing to disk actually increases memory usage!!!

Minimal Reproducer

Using memory_profile I profile repeated calls to ReducedFunctional.__call__ and ReducedFunctional.derivative. Simply run mprof run ...

Code for Reproduction

from firedrake import *
from firedrake.adjoint import *
import gc

def test():
    T_c, rf = rf_generator()

    for i in range(5):
        gc.collect()
        rf.__call__(T_c)
        gc.collect()
        rf.derivative()

def rf_generator():
    tape = get_working_tape()
    tape.clear_tape()
    continue_annotation()
    enable_disk_checkpointing()
    
    mesh = RectangleMesh(100, 100, 1.0, 1.0)
    mesh = checkpointable_mesh(mesh)

    V = VectorFunctionSpace(mesh, "CG", 2)
    Q = FunctionSpace(mesh, "CG", 1)

    X = SpatialCoordinate(mesh)
    w = Function(V, name="rotation").interpolate(as_vector([-X[1] - 0.5, X[0] - 0.5]))
    T_c = Function(Q, name="control")
    T = Function(Q, name="Temperature")
    
    T_c.interpolate(0.1 * exp(-0.5 * ((X - as_vector((0.75, 0.5))) / Constant(0.1)) ** 2))
    control = Control(T_c)
    T.assign(T_c)

    for i in range(20):
        T.interpolate(T + inner(grad(T), w) * Constant(0.0001))

    objective = assemble(T**2 * dx)

    pause_annotation()
    return T_c, ReducedFunctional(objective, control)

if __name__ == "__main__":
    test()

Without checkpoint to disk:
Image

with checkpointing to disk:

Image

Am I missing here or there is an actual leak here, specially when checkpointing to disk?

@connorjward
Copy link
Contributor

@Ig-dolci, you have done a lot of work investigating this sort of thing. Do you have any suggestions?

@colinjcotter
Copy link
Contributor

Does it still happen if you replace interpolate with project (which uses a solver).

@Ig-dolci Ig-dolci self-assigned this Feb 7, 2025
@Ig-dolci
Copy link
Contributor

Ig-dolci commented Feb 7, 2025

I will check that.

@sghelichkhani
Copy link
Contributor Author

Thanks @Ig-dolci for looking into this. @colinjcotter same behaviour with project. The reproducer is this one https://github.com/g-adopt/g-adopt/blob/adjoint-memory/demos/mantle_convection/test/tester.py

Without checkpointing to disk:
Image
with checkpointing to disk:
Image
The gadopt problem we are seeing this in is this one https://github.com/g-adopt/g-adopt/blob/adjoint-memory/demos/adjoint_spherical/adjoint.py. Basically time-stepping through a stokes problem. So almost only solves with a few projections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants