Kubernetes Probes #142

0xForerunner · 2025-03-19T02:37:09Z

This PR is dependent on #141
You can view the diff from that PR here

Health, readiness, liveness checks layer added.

healthz determines wether everything is 100% up and running. If the builder fails to produce paylaods the healthz endpoint will return an error. op-conductor should eventually be able to use this signal to switch to a different sequencer in an HA sequencer setup.
livez currently just exists to make sure we're serving any requests at all. If this ever returns an error at any point, kubernetes can use this signal to restart the pod.
readyz returns true as long as we're effectively building payloads from the l2 client. This means that we still build blocks with this instance of rollup-boost.

vercel · 2025-03-19T02:37:13Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
rollup-boost	⬜️ Ignored (Inspect)	Visit Preview		Apr 26, 2025 0:26am

avalonche · 2025-03-24T23:44:28Z

Can you add docs on how this works with op-conductor? from the description it seems like each sequencer in the HA setup has 1 builder. how would op-conductor behave if every builder was down and not healthy?

0xForerunner · 2025-03-25T06:25:17Z

from the description it seems like each sequencer in the HA setup has 1 builder

Yeah this seems to be the typical setup. A 1 to many builder-sequencer relationship is undefined behaviour according to OP. So you need 1:1 sequencer builder.

from the description it seems like each sequencer in the HA setup has 1 builder

Currently op conductor doesn't have any logic for this. Once we have the appropriate probes here we can try to get op-conductor support.

how would op-conductor behave if every builder was down and not healthy

If all 3 are not healthy then we're in a pretty bad state. I suppose a random instance would be chosen to be the primary sequencer in that case.

Co-authored-by: shana <[email protected]>

0xForerunner · 2025-03-28T21:29:39Z

@avalonche @0xOsiris I've added some docs.

I'm second guessing my use of the /readyz probe here. I'm sort of imparting some extra meaning to it that isn't typically expected. I think perhaps a better way to accomplish this would be to change the response status codes from /healthz.

/healthz responses:

200 OK - healthy
206 Partial Content - l2 creating blocks, builder is down
503 Service Unavailable - We're not building any blocks

What do you think?

…s codes

angel-ding-cb

We have a different design for rollup boost integration into sequencer HA, which doesn't require this change.

The full TDD is here, feel free to check it out.

I have a longer response in the op conductor discord channel (tagged you already). The above TDD is only for rollup boost integration. It's not for builder HA

.dockerignore

avalonche

lgtm, only thing is that I would add in the docs /healthz returns 206 and 503 after the builder / l2 fails to produce a block only once and that endpoint will still return 200 if the builder is up but the local l2 is not

teddyknox · 2025-04-24T16:56:28Z

FYI the op-conductor change ethereum-optimism/optimism#15316 depends on this PR now.

teddyknox · 2025-04-24T16:59:31Z

lgtm, only thing is that I would add in the docs /healthz returns 206 and 503 after the builder / l2 fails to produce a block only once and that endpoint will still return 200 if the builder is up but the local l2 is not

This should probably be debounced locally if this is expected to happen frequently.

zhwrd · 2025-04-24T17:29:08Z

Have been integration testing this with our conductor rollup-boost monitoring PR

It seems the /healthz response is sticky which I dont think will work great for conductor, for ex:

sequencer -> rollup-boost -> rbuilder are all healthy, actively sequencing
stop rbuilder (simulating deployment or failure)
rollup-boost reports partial health, triggers conductor healthcheck failure and leadership transfer
start rbuilder (expecting rollup-boost would pick this up and start reporting 200)
rollup-boost still reports partial health, conductor stuck reporting the sequencer as unhealthy

Similarly, you can kill r-builder on a non-active sequencer and conductor will report is as healthy.

Is there any way we can move the health probe to background so we get async rbuilder health updates?

0xOsiris · 2025-04-24T17:32:28Z

Have been integration testing this with our conductor rollup-boost monitoring PR

It seems the /healthz response is sticky which I dont think will work great for conductor, for ex:

sequencer -> rollup-boost -> rbuilder are all healthy, actively sequencing

stop rbuilder (simulating deployment or failure)

rollup-boost reports partial health, triggers conductor healthcheck failure and leadership transfer

start rbuilder (expecting rollup-boost would pick this up and start reporting 200)

rollup-boost still reports partial health, conductor stuck reporting the sequencer as unhealthy

Similarly, you can kill r-builder on a non-active sequencer and conductor will report is as healthy.

Is there any way we can move the health probe to background so we get async rbuilder health updates?

Yeah this makes sense - this is because the health probe is only updated during a get_payload call which would only be instantiated when op-node is in a sequencing state. So - you are right after a failover happens partial health will never turn back to healthy. Will see how we can fix this!

…for non-sequencing el's

Osiris/background health check

0xOsiris · 2025-04-27T21:01:24Z

@zhwrd @teddyknox Thanks for the callout on the sticky health status. I've added an additional background health check to the rollup-boost server that continuously monitors unsafe head progression of the builder which should functionally work the same as the health check op-conductor is performing to ensure the unsafe head is progressing within CONDUCTOR_HEALTH_CHECK_UNSAFE_INTERVAL.

This should resolve the issue of the health status not being updated on non-sequencing EL's. In the sequencing case we have now have 2 health checks running in parallel.

The builder is returning valid payloads
The builder is synced

0xForerunner added 12 commits March 14, 2025 15:23

wip

8e7a51b

wip

39ef56e

wip

e49f054

wip

f77ff56

clean things up

6249337

fix for cloned service

0641965

cleanup process_response

79bab4a

eyre bail

8ceabac

remove unnecessary deps

bf55d6e

Add kubernetes probe layer

b9b242e

implement health/ready check logic

121bb2b

modify ready logic

2e1ce26

0xForerunner added 3 commits March 18, 2025 19:40

fix comment/feature

04678f7

delete old file

7a3e0a0

working

0a1d9eb

0xForerunner and others added 8 commits March 24, 2025 23:26

Update src/client/http.rs

adbf5f3

Co-authored-by: shana <[email protected]>

Merge branch 'main' into forerunner/proxy

636f7e9

parse response cod

f5e8f69

clippy fix

7497301

Merge branch 'main' into forerunner/probes

fd321ac

Merge branch 'forerunner/proxy' into forerunner/probes

de15894

Merge branch 'main' into forerunner/probes

de3d03e

Probe docs

ddd6d9b

0xForerunner added 3 commits March 28, 2025 15:18

Switch to returning health status only from /healthz using http statu…

9b60991

…s codes

Update docs to describe health status codes

74d723e

remove stray comments

2732c38

0xOsiris requested a review from 0xKitsune April 8, 2025 23:18

angel-ding-cb reviewed Apr 9, 2025

View reviewed changes

teddyknox mentioned this pull request Apr 9, 2025

Add rollup-boost monitoring to op-conductor ethereum-optimism/optimism#15316

Merged

0x00101010 mentioned this pull request Apr 21, 2025

docs: rollup-boost HA design #185

Merged

0xOsiris added 2 commits April 22, 2025 15:50

fix: default to healthy status

3b06bd9

merge main

fbfdb20

avalonche reviewed Apr 23, 2025

View reviewed changes

.dockerignore Outdated Show resolved Hide resolved

chore: fix dockerignore

6d8b40a

avalonche approved these changes Apr 23, 2025

View reviewed changes

ferranbt approved these changes Apr 24, 2025

View reviewed changes

0xOsiris added 13 commits April 24, 2025 12:45

feat: add background process to query block height as a health check …

a6a94ef

…for non-sequencing el's

fix: signatures

bb07cdf

chore: update comments

c6855a4

chore: clippy

eae3273

test: add tests

200b407

fix: stress tests

c4d0c76

fix: change health check to check unsafe head progression on builder

f15e7c4

chore: update doc comments

1cd09aa

fix: loop

0db4e36

chore: update comments

7cc6c08

Merge pull request #13 from flashbots/osiris/background-health-check

7d5a863

Osiris/background health check

merge main

505ad2a

Merge branch 'main' into forerunner/probes

6d4afc0

0xOsiris merged commit 8f29ecf into flashbots:main Apr 29, 2025
5 checks passed

0xOsiris deleted the forerunner/probes branch April 29, 2025 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kubernetes Probes #142

Kubernetes Probes #142

Uh oh!

0xForerunner commented Mar 19, 2025

Uh oh!

vercel bot commented Mar 19, 2025 •

edited

Loading

Uh oh!

avalonche commented Mar 24, 2025

Uh oh!

0xForerunner commented Mar 25, 2025

Uh oh!

0xForerunner commented Mar 28, 2025

Uh oh!

angel-ding-cb left a comment •

edited

Loading

Uh oh!

Uh oh!

avalonche left a comment

Uh oh!

teddyknox commented Apr 24, 2025

Uh oh!

teddyknox commented Apr 24, 2025

Uh oh!

zhwrd commented Apr 24, 2025

Uh oh!

0xOsiris commented Apr 24, 2025

Uh oh!

0xOsiris commented Apr 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Kubernetes Probes #142

Kubernetes Probes #142

Uh oh!

Conversation

0xForerunner commented Mar 19, 2025

Uh oh!

vercel bot commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avalonche commented Mar 24, 2025

Uh oh!

0xForerunner commented Mar 25, 2025

Uh oh!

0xForerunner commented Mar 28, 2025

Uh oh!

angel-ding-cb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

avalonche left a comment

Choose a reason for hiding this comment

Uh oh!

teddyknox commented Apr 24, 2025

Uh oh!

teddyknox commented Apr 24, 2025

Uh oh!

zhwrd commented Apr 24, 2025

Uh oh!

0xOsiris commented Apr 24, 2025

Uh oh!

0xOsiris commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vercel bot commented Mar 19, 2025 •

edited

Loading

angel-ding-cb left a comment •

edited

Loading

0xOsiris commented Apr 27, 2025 •

edited

Loading