-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2025-03-03 ZeroTier - health checking - alternative proposal #38
base: main
Are you sure you want to change the base?
Conversation
This PR follows on from the extensive discussion associated with zyclonite#37. Never before have I even *contemplated* submitting a PR covering the same ground as an existing open PR. However, on this occasion I thought it might be useful to have a concrete proposal to compare and contrast with zyclonite#37. I sincerely hope that laying this on the (virtual) table and then minimising further interaction *might* help us converge on a solution. <hr> Changes: * `docker-compose.yml` and `docker-compose-router.yml`: - replaces deprecated `version` statement with `---`. - adds example environment variables. * `Dockerfile` - corrects case of "as" to "AS" (silences build warning). - adds and configures `healthcheck.sh` (as per zyclonite#37). - includes `tzdata` package (moved from `Dockerfile.router`) so messages have local timestamps. * `Dockerfile.router` - removes `tzdata` (moved to `Dockerfile`). * `entrypoint-router.sh`: - code for first launch auto join of listed networks expanded to include additional help material. * `entrypoint.sh`: - "first launch" auto join of listed networks (code copied from `entrypoint-router.sh`, as modified per above). - "self repair" of permissions in persistent store (code copied from `entrypoint-router.sh`). - adds launch-time message to make it clear that the client is launching (complements messages in `entrypoint-router.sh`). - abstracts some common strings to environment variables (opportunistic change). * `README.md`: - updates examples. - describes new environment variables (including move of `ZEROTIER_ONE_NETWORK_IDS` from `README-router.md`. - documents health-checking. * `README-router.md` - updates examples. - explains relationship of router and client. Added: * `healthcheck.sh`, based on original proposal in zyclonite#37 and subsequent suggestions for modification by me. I gave serious consideration to the code for synchronising networks in the entry point scripts. The idea is quite attractive. It is safe to automate joins in a "clean slate" situation. However, a *leave* followed by a *join* is not guaranteed to be idempotent. That's because the *leave* destroys the network-specific configuration options (`allowManaged`, `allowGlobal`, `allowDefault`, `allowDNS`). On balance I think it's better left to users to send explicit *leave* commands via the CLI and take responsibility for restoring lost configuration options on any subsequent *join*. I will post the results of testing this PR separately. Signed-off-by: Phill Kelley <[email protected]>
TestingThis post includes a series of tests I have run against containers built using the changes proposed in this PR. reference service definitionsclient zerotier:
container_name: zerotier
image: "zyclonite/zerotier:local"
restart: unless-stopped
network_mode: host
environment:
- TZ=${TZ:-Etc/UTC}
# - ZEROTIER_ONE_NETWORK_IDS=9999888877776666
# - ZEROTIER_ONE_CHK_SPECIFIC_NETWORKS=9999888877776666
# - ZEROTIER_ONE_CHK_MIN_ROUTES_FOR_HEALTH=1
volumes:
- ./volumes/zerotier-one:/var/lib/zerotier-one
devices:
- "/dev/net/tun:/dev/net/tun"
cap_add:
- NET_ADMIN
- SYS_ADMIN router zerotier-router:
container_name: zerotier
image: "zyclonite/zerotier-router:local"
restart: unless-stopped
network_mode: host
environment:
- TZ=${TZ:-Etc/UTC}
- ZEROTIER_ONE_LOCAL_PHYS=ens18
- ZEROTIER_ONE_USE_IPTABLES_NFT=true
- ZEROTIER_ONE_GATEWAY_MODE=both
# - ZEROTIER_ONE_NETWORK_IDS=9999888877776666
# - ZEROTIER_ONE_CHK_SPECIFIC_NETWORKS=9999888877776666
# - ZEROTIER_ONE_CHK_MIN_ROUTES_FOR_HEALTH=1
volumes:
- ./volumes/zerotier-one:/var/lib/zerotier-one
devices:
- "/dev/net/tun:/dev/net/tun"
cap_add:
- NET_ADMIN
- SYS_ADMIN
- NET_RAW client testingclean slate - null configurationconfiguration
setup
Interpretation:
test
Interpretation:
minimum route checkingconfiguration
setup
Interpretation:
test
Interpretation:
specific network checkingconfiguration
setup
Interpretation:
test
Interpretation:
router testingminimum route checkingconfiguration
setup
Interpretation:
test
Interpretation:
specific network checkingconfiguration
setup
Interpretation:
test
Interpretation:
|
Hmmm. That's a bit odd: Seems to be moaning about the --start-interval flag but that's definitely defined in the Dockerfile documentation. I don't get this error when I do a local build. Any ideas? Edit 1 - maybe the second sentence in the doco explains it:
Edit 2 - it seems to be a |
My humble question for a small use case scenario : |
Testing
Use either a specific networks test or a minimum route count test. reference service definitions zerotier:
container_name: zerotier
image: "zyclonite/zerotier:local"
restart: unless-stopped
network_mode: host
environment:
- TZ=${TZ:-Etc/UTC}
# - ZEROTIER_ONE_NETWORK_IDS=9999888877776666 9999888877775555
# - ZEROTIER_ONE_CHK_SPECIFIC_NETWORKS=9999888877776666 9999888877775555
# - ZEROTIER_ONE_CHK_MIN_ROUTES_FOR_HEALTH=2
volumes:
- ./volumes/zerotier-one:/var/lib/zerotier-one
devices:
- "/dev/net/tun:/dev/net/tun"
cap_add:
- NET_ADMIN
- SYS_ADMIN client testingRouter has same health check so it will have same behaviour. check using specific networksconfiguration
setup
Interpretation:
test
Interpretation:
check using minimum route countconfiguration
setup
Interpretation:
test
Interpretation:
|
@zyclonite - Lukas does this comment make any sense to you in the ZeroTier build context? I have no experience with GitHub Actions so I don't understand anything about how these things run. Do you get to specify the version of buildah, or is this just a service you're using that's maintained by someone else (eg GitHub provides this as a standard service)? According to buildah install.md, for Debian I need Bookworm plus a standard For Ubuntu the prerequisite is 20.10 and newer, also with a standard Both of these versions are less than the v1.37.0 cited as where the fix went into production. The latest and greatest appears to be 1.39.1. I don't get why the Bookworm and Noble Numbat I started to read Building from scratch but my eyes quickly glazed over. |
This PR follows on from the extensive discussion associated with #37.
Never before have I even contemplated submitting a PR covering the same ground as an existing open PR. However, on this occasion I thought it might be useful to have a concrete proposal to compare and contrast with #37.
I sincerely hope that laying this on the (virtual) table and then minimising further interaction might help us converge on a solution.
Changes:
docker-compose.yml
anddocker-compose-router.yml
:replaces deprecated
version
statement with---
.adds example environment variables.
Dockerfile
corrects case of "as" to "AS" (silences build warning).
adds and configures
healthcheck.sh
(as per [Enhancement] Add Health-Check to DockerFile #37).includes
tzdata
package (moved fromDockerfile.router
) so messages have local timestamps.Dockerfile.router
tzdata
(moved toDockerfile
).entrypoint-router.sh
:entrypoint.sh
:"first launch" auto join of listed networks (code copied from
entrypoint-router.sh
, as modified per above)."self repair" of permissions in persistent store (code copied from
entrypoint-router.sh
).adds launch-time message to make it clear that the client is launching (complements messages in
entrypoint-router.sh
).abstracts some common strings to environment variables (opportunistic change).
README.md
:updates examples.
describes new environment variables (including move of
ZEROTIER_ONE_NETWORK_IDS
fromREADME-router.md
.documents health-checking.
README-router.md
Added:
healthcheck.sh
, based on original proposal in [Enhancement] Add Health-Check to DockerFile #37 and subsequent suggestions for modification by me.I gave serious consideration to the code for synchronising networks in the entry point scripts. The idea is quite attractive. It is safe to automate joins in a "clean slate" situation. However, a leave followed by a join is not guaranteed to be idempotent. That's because the leave destroys the network-specific configuration options (
allowManaged
,allowGlobal
,allowDefault
,allowDNS
).On balance I think it's better left to users to send explicit leave commands via the CLI and take responsibility for restoring lost configuration options on any subsequent join.
I will post the results of testing this PR separately.