Skip to content

helm, info, status: Check and report reachability for scan and sequencer#6073

Draft
giner wants to merge 4 commits into
canton-network:mainfrom
giner:stas/helm_info_status_check_connectivity
Draft

helm, info, status: Check and report reachability for scan and sequencer#6073
giner wants to merge 4 commits into
canton-network:mainfrom
giner:stas/helm_info_status_check_connectivity

Conversation

@giner

@giner giner commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Possible status values:

  • Scan: 0 (reachable and not lagging), 1 (lagging), 2 (unreachable).
  • Sequencer: 0 (reachable and not lagging), 1 (lagging), 2 (unreachable), 3 (unreachable and lagging).

@giner giner force-pushed the stas/helm_info_status_check_connectivity branch from b376a39 to b4302e1 Compare June 22, 2026 10:21
@giner giner force-pushed the stas/helm_info_status_check_connectivity branch from 96d24ab to 69a89d9 Compare June 22, 2026 10:43
@giner giner changed the title helm, info: Check and report reachability for scan and sequencer helm, info, status: Check and report reachability for scan and sequencer Jun 22, 2026
giner added 2 commits June 23, 2026 16:19
Possible status values:
- Scan: 0 (reachable and not lagging), 1 (lagging), 2 (unreachable).
- Sequencer: 0 (reachable and not lagging), 1 (lagging), 2
  (unreachable), 3 (unreachable and lagging).

Signed-off-by: Stanislav German-Evtushenko <ginermail@gmail.com>
Signed-off-by: Stanislav German-Evtushenko <ginermail@gmail.com>
@giner giner force-pushed the stas/helm_info_status_check_connectivity branch from 69a89d9 to 01a3ced Compare June 23, 2026 07:20
@isegall-da

Copy link
Copy Markdown
Contributor

/cluster_test

@github-actions

Copy link
Copy Markdown

Deploy cluster test triggered for Commit 01a3ced14e736c8d92fc1b7b9fb3d072cb36585b in , please contact a Contributor to approve it in CircleCI: https://app.circleci.com/pipelines/github/DACH-NY/canton-network-internal/71127

@isegall-da

Copy link
Copy Markdown
Contributor

/cluster_test

@github-actions

Copy link
Copy Markdown

Deploy cluster test triggered for Commit 01a3ced14e736c8d92fc1b7b9fb3d072cb36585b in , please contact a Contributor to approve it in CircleCI: https://app.circleci.com/pipelines/github/DACH-NY/canton-network-internal/null

@isegall-da

Copy link
Copy Markdown
Contributor

/cluster_test

@github-actions

Copy link
Copy Markdown

Deploy cluster test triggered for Commit 01a3ced14e736c8d92fc1b7b9fb3d072cb36585b in , please contact a Contributor to approve it in CircleCI: https://app.circleci.com/pipelines/github/DACH-NY/canton-network-internal/71265

@martinflorian-da

Copy link
Copy Markdown
Contributor

@martinflorian-da martinflorian-da added the static Used to label PRs for which static tests suffice label Jun 24, 2026

@martinflorian-da martinflorian-da left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Itai that this is hard to review. The downside of getting something wrong here is limited though IMO, so happy to rush this a bit to make the release cut tomorrow, and we can always fix things later on...

@giner I'll fix your merge conflict and set automerge now, so we get a little bit of more testing before the release cut.

GRPCURL_DIST=$(mktemp)
GRPCURL_TMPDIR=$(mktemp -d)

echo "Downloading grpcurl..." >&2

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it happen that we do this multiple times in parallel if we run grpcurl in parallel? Can't we move that installation to some init step? I get that we don't want to change to image just for that... but that last-moment install seems a bit fishy.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(happy for that to be a follow-up improvement, if you think it makes sense)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, while it doesn't break anything here we do download in parallel. thank you for catching, I'll fix that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e18d260

…_check_connectivity

[static]

Signed-off-by: Martin Florian <martin.florian@digitalasset.com>

@martinflorian-da martinflorian-da left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually need to retract my approval following some input from a watchful colleague ™️. Downloading remote binaries in a bash script is fairly sketchy from a supply chain security standpoint; at the very least I'd like us to review this with less haste.

How robust is that sha check against future code changes to this script, for example?

Perhaps the clean way is to use a different image that has grpcurl baked in?

@giner

giner commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

I agree with Itai that this is hard to review.

Let me try to refactor it a bit on the next iteration (remove some verbose bash 3 compatibility stuff, split into smaller easier to read, self described pieces). If it's still hard to read after that I'll look into rewriting it Python.

@giner

giner commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Perhaps the clean way is to use a different image that has grpcurl baked in?

I couldn't find a good image containing all: jq, curl, prom2json and grpcurl. If downloading at runtime is no go we'll have to manage another image.

Make sure only one grpcurl downloading job runs at a time.

Signed-off-by: Stanislav German-Evtushenko <ginermail@gmail.com>
@moritzkiefer-da

Copy link
Copy Markdown
Contributor

There are two somewhat orthogonal issues for me:

  1. Downloading tools at runtime is a hard no for me here.
  2. grpcurl itself is in a bit of a dodgy state. If you look through the issue tracker there are a bunch of issues for dependencies with known vulnerabilities without a release that fixes it and it gets flagged by most docker image scanners. That's why we just removed it from the base image of all the actual apps. I'm not necessarily saying that those vulnerabilities do apply the way we use it here but I don't even want to have to reason about this and explain this every time an auditor runs a scan.

I suspect in most SVs setups the info pod has direct access to things like participant admin APIs so it is highly security critical so we definitely should err on the side of being overly cautious.

If we can build an image containing https://github.com/grpc-ecosystem/grpc-health-probe to get the same info, that seems like a reasonable option. I don't really know any other alternative that is well maintained and doesn't get flagged. We could write some custom python program or similar but at that point, I'd also start to somewhat question whether the complexity is worth the ROI.

@giner

giner commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Thank you @moritzkiefer-da. Would it work if for now I just replace grpcurl with grpc-health-probe but keep downloading at runtime (we already do this for prom2json in the same script) and within the next few iteration I would refactor the script for more readability and add a docker image?

@moritzkiefer-da

Copy link
Copy Markdown
Contributor

Given that I don't see that we're in a particular rush to land this change I'd rather do it the other way around and switch to a base image containing the tools first and make the script more readable instead of adding more complexity and potentially risky downloads first.

@giner

giner commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Alright, converting draft for now

@giner giner marked this pull request as draft June 25, 2026 07:00
@martinflorian-da

Copy link
Copy Markdown
Contributor

@giner To be clear: It seems reasonable that we add a splice-info image of some sort for this. The base image we use for all splice images already contains grpc-health-probe which should also work here: https://github.com/canton-network/canton-base-images

About this PR: You could also shrink it to only scan monitoring, and follow up with the sequencer things in a follow up 🤷

@giner

giner commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

@martinflorian-da scan has been already monitored, in this PR I was only splitting the status into two distinct states. It's quite minor from monitoring point of view while having sequencer monitored is more substantial.

As for the base image, yes, it has grpc-health-probe but missing curl, jq and prom2json. I'm guessing we need a separate image for these?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

static Used to label PRs for which static tests suffice

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants