Separate forward / backward computation, compute grad() without re-evaluation #7318

LER0ever · 2021-07-18T06:21:26Z

LER0ever
Jul 18, 2021

Say I have a model forward function loss_fn, which takes the input data and return the computed loss. Currently JAX is able to perform the following:

loss_fn(input) to compute the forward function
grad(loss_fn)(input) to compute the gradient at input, also evaluates loss_fn
value_and_grad(loss_fn)(input) returns both loss and gradients, also evaluates loss_fn

But now I would like to separate the execution of forward and backward computation, evaluating the function first, and then use the intermediate variables as well as the loss to compute the backward. Is it possible to save all the activations of the function, and avoid re-evaluating the original function by passing/storing the computed activations to perform backward propagation?

Something like the following would be helpful, especially when training models in a pipeline parallel fashion:

# execute two batch: F1, F2, B2, B1
forward_ctx_1, loss_1 = loss_fn(batch_1)
forward_ctx_2, loss_2 = loss_fn(batch_2)

# I could save all those forward_ctx somewhere else, and do something else here

grads_2 = grad_fn(forward_ctx_2, loss_2) 
grads_1 = grad_fn(forward_ctx_1, loss_1)
# grad_fn() only perform backward, like loss.backward() in Torch

# optimizer step here

lamflokas · 2021-07-18T10:36:26Z

lamflokas
Jul 18, 2021

vjp seems to be the closest thing to what you are asking. It can return your loss value as well as a function that can calculate vector jacobian products (vjps). As noted in the docs, gradient is a special case of VJPs so you can use the returned function to compute the gradient.

The problem for your use case is that vjp does not directly expose the intermediates. I am not sure if you can save the returned function directly (e.g. via using pickle or another tool). But this can be sidestepped. The function returned by vjp seems to be a partial function with some arguments of the partial function also being partial functions themselves. You can rewrite the vjp so that it return a stateless function and its fixed arguments seperately. Then you can save the fixed arguments and later restore them to call the stateless function.

0 replies

burnpanck · 2025-08-08T14:40:13Z

burnpanck
Aug 8, 2025

Extending on the answer by @lamflokas, you are essentially looking for the unbound_vjp function, which contains the backward pass of the gradient computation, without having intermediate values bound (i.e. the "forward pass context"). That function comes back through a couple of layers of partial as the output of jax.vjp.
Thus, you can call jax.vjp to access the primals (the forward pass output) and that partial vjp, bundling both the unbound_vjp and the "forward pass context". You can split them apart, and later combine them back to do the backward pass. In my case, I was looking to jit the backward pass function separately from the forward pass, which doesn't work with jax.vjp, because you'd compile in the inputs as constants. Here is the demonstration how to achieve that:

@jax.jit
def f(x,y):
    return jnp.cos(x)*(1+jnp.sin(y))

@jax.jit
def f_fwd(x,y):
    z, vjp = jax.vjp(f,x,y)
    # extract the constants from the bound VJP
    intermediate_values = vjp.args[0].args[0]
    return z, intermediate_values

# somehow gain access to the  `unbound_vjp` 
_,abstract_vjp = jax.vjp(f, 0., 0.)

@jax.jit
def f_bwd(intermediate_values, ct):
    # re-assemble constants/intermediate values into a bound VJP
    vjp = type(abstract_vjp)(
        abstract_vjp.func,
        type(abstract_vjp.args[0])(
            abstract_vjp.args[0].func,
            intermediate_values,
            *abstract_vjp.args[0].args[1:],
        ),
        *abstract_vjp.args[1:],
    )
    return vjp(ct)

# demonstration
for x in np.linspace(0,np.pi,31):
    for y in np.linspace(0,np.pi,31):
        ctgt = 1.
        
        # forward pass in normal `jax.vjp`; you get access to the backward pass function only *after* you pass in the input
        ve, vjp = jax.vjp(f,x,y)
        # backward pass
        gxe, gye = vjp(ctgt)
        
        # compiled forward pass returning the "intermediate values" for later re-use in a backward pass
        va,ivals = f_fwd(x,y)
        # pre-compiled backward pass using those intermediate values 
        gxa, gya = f_bwd(ivals, ctgt)

        # verify we have the same results
        expected = np.r_[ve,gxe,gye]
        actual = np.r_[va,gxa,gya]
        assert np.allclose(actual, expected), f"{x=:.3f} {y=:.3f}: {np.round(expected,3)} <=> {np.round(actual,3)} ({ivals})"

The challenge here is that the abstract_vjp above contains a JAXPR for that unbound_vjp traced exactly for the arguments as provided to the call to jax.vjp. Thus, the compiled backward pass only works if you use exactly matching argument trees. Normally, jax.jit manages this for you, re-tracing if you use another signature. But our f_bwd never sees that signature, so it can't re-trace.

1 reply

burnpanck Aug 8, 2025

Obviously, the above groping into the internals of the returned abstract_vjp is highly dependent on implementation details, and is bound to break with new JAX releases. The above worked for JAX 0.6.2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Separate forward / backward computation, compute grad() without re-evaluation #7318

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Separate forward / backward computation, compute grad() without re-evaluation #7318

Uh oh!

Uh oh!

LER0ever Jul 18, 2021

Replies: 2 comments · 1 reply

Uh oh!

lamflokas Jul 18, 2021

Uh oh!

burnpanck Aug 8, 2025

Uh oh!

burnpanck Aug 8, 2025

LER0ever
Jul 18, 2021

Replies: 2 comments 1 reply

lamflokas
Jul 18, 2021

burnpanck
Aug 8, 2025