Skip to content

Kubernetes Probes #142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 47 commits into
base: main
Choose a base branch
from
Open

Conversation

0xForerunner
Copy link
Collaborator

This PR is dependent on #141
You can view the diff from that PR here

Health, readiness, liveness checks layer added.

  • healthz determines wether everything is 100% up and running. If the builder fails to produce paylaods the healthz endpoint will return an error. op-conductor should eventually be able to use this signal to switch to a different sequencer in an HA sequencer setup.
  • livez currently just exists to make sure we're serving any requests at all. If this ever returns an error at any point, kubernetes can use this signal to restart the pod.
  • readyz returns true as long as we're effectively building payloads from the l2 client. This means that we still build blocks with this instance of rollup-boost.

Copy link

vercel bot commented Mar 19, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
rollup-boost ⬜️ Ignored (Inspect) Visit Preview Apr 26, 2025 0:26am

@avalonche
Copy link
Collaborator

Can you add docs on how this works with op-conductor? from the description it seems like each sequencer in the HA setup has 1 builder. how would op-conductor behave if every builder was down and not healthy?

@0xForerunner
Copy link
Collaborator Author

from the description it seems like each sequencer in the HA setup has 1 builder

Yeah this seems to be the typical setup. A 1 to many builder-sequencer relationship is undefined behaviour according to OP. So you need 1:1 sequencer builder.

from the description it seems like each sequencer in the HA setup has 1 builder

Currently op conductor doesn't have any logic for this. Once we have the appropriate probes here we can try to get op-conductor support.

how would op-conductor behave if every builder was down and not healthy

If all 3 are not healthy then we're in a pretty bad state. I suppose a random instance would be chosen to be the primary sequencer in that case.

@0xForerunner
Copy link
Collaborator Author

@avalonche @0xOsiris I've added some docs.

I'm second guessing my use of the /readyz probe here. I'm sort of imparting some extra meaning to it that isn't typically expected. I think perhaps a better way to accomplish this would be to change the response status codes from /healthz.

/healthz responses:

  • 200 OK - healthy
  • 206 Partial Content - l2 creating blocks, builder is down
  • 503 Service Unavailable - We're not building any blocks

What do you think?

Copy link

@angel-ding-cb angel-ding-cb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a different design for rollup boost integration into sequencer HA, which doesn't require this change.

The full TDD is here, feel free to check it out.

I have a longer response in the op conductor discord channel (tagged you already). The above TDD is only for rollup boost integration. It's not for builder HA

Copy link
Collaborator

@avalonche avalonche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, only thing is that I would add in the docs /healthz returns 206 and 503 after the builder / l2 fails to produce a block only once and that endpoint will still return 200 if the builder is up but the local l2 is not

@teddyknox
Copy link

FYI the op-conductor change ethereum-optimism/optimism#15316 depends on this PR now.

@teddyknox
Copy link

lgtm, only thing is that I would add in the docs /healthz returns 206 and 503 after the builder / l2 fails to produce a block only once and that endpoint will still return 200 if the builder is up but the local l2 is not

This should probably be debounced locally if this is expected to happen frequently.

@zhwrd
Copy link

zhwrd commented Apr 24, 2025

Have been integration testing this with our conductor rollup-boost monitoring PR

It seems the /healthz response is sticky which I dont think will work great for conductor, for ex:

  1. sequencer -> rollup-boost -> rbuilder are all healthy, actively sequencing
  2. stop rbuilder (simulating deployment or failure)
  3. rollup-boost reports partial health, triggers conductor healthcheck failure and leadership transfer
  4. start rbuilder (expecting rollup-boost would pick this up and start reporting 200)
  5. rollup-boost still reports partial health, conductor stuck reporting the sequencer as unhealthy

Similarly, you can kill r-builder on a non-active sequencer and conductor will report is as healthy.

Is there any way we can move the health probe to background so we get async rbuilder health updates?

@0xOsiris
Copy link
Collaborator

Have been integration testing this with our conductor rollup-boost monitoring PR

It seems the /healthz response is sticky which I dont think will work great for conductor, for ex:

  1. sequencer -> rollup-boost -> rbuilder are all healthy, actively sequencing
  2. stop rbuilder (simulating deployment or failure)
  3. rollup-boost reports partial health, triggers conductor healthcheck failure and leadership transfer
  4. start rbuilder (expecting rollup-boost would pick this up and start reporting 200)
  5. rollup-boost still reports partial health, conductor stuck reporting the sequencer as unhealthy

Similarly, you can kill r-builder on a non-active sequencer and conductor will report is as healthy.

Is there any way we can move the health probe to background so we get async rbuilder health updates?

Yeah this makes sense - this is because the health probe is only updated during a get_payload call which would only be instantiated when op-node is in a sequencing state. So - you are right after a failover happens partial health will never turn back to healthy. Will see how we can fix this!

@0xOsiris
Copy link
Collaborator

0xOsiris commented Apr 27, 2025

@zhwrd @teddyknox Thanks for the callout on the sticky health status. I've added an additional background health check to the rollup-boost server that continuously monitors unsafe head progression of the builder which should functionally work the same as the health check op-conductor is performing to ensure the unsafe head is progressing within CONDUCTOR_HEALTH_CHECK_UNSAFE_INTERVAL.

This should resolve the issue of the health status not being updated on non-sequencing EL's. In the sequencing case we have now have 2 health checks running in parallel.

  • The builder is returning valid payloads
  • The builder is synced

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants