Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kubernetes election module #3721

Merged
merged 9 commits into from
Feb 10, 2025
Merged

Conversation

osmman
Copy link
Contributor

@osmman osmman commented Jan 27, 2025

This PR adds a new leader election implementation using the Kubernetes Lease API.

By leveraging Kubernetes Lease resources, this implementation provides an alternative backend for leader election. It simplifies deployments on Kubernetes by removing the need for a secondary etcd cluster when running Kubernetes workloads.

Fixes #3431

Checklist

Copy link

google-cla bot commented Jan 27, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@osmman osmman force-pushed the k8s-election branch 3 times, most recently from 922ebcf to 68897bf Compare January 27, 2025 16:29
Copy link
Contributor

@mhutchinson mhutchinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. Are you happy to be an owner of the new k8s election code module? The core team doesn't have the time or expertise to take on ownership of this. On the whole this role is largely being responsible for reviewing code changes from other users and any dependency updates that require code changes.

I've left some comments on the old code that was refactored to make space for this but I haven't reviewed the new code yet. I'm at a summit this week but this is on my radar for when I have time.

cmd/trillian_log_signer/main.go Outdated Show resolved Hide resolved
util/election2/etcd/provider.go Outdated Show resolved Hide resolved
util/election2/etcd/provider.go Outdated Show resolved Hide resolved
util/election2/etcd/provider.go Show resolved Hide resolved
util/election2/etcd/provider.go Outdated Show resolved Hide resolved
util/election2/provider.go Outdated Show resolved Hide resolved
Introduce a new election module using the Kubernetes Lease API. Modify
the existing mechanism to allow selection of election mechanisms by
introducing the `--election_system` parameter, following the same
pattern used for storage and quota system selection.
@osmman
Copy link
Contributor Author

osmman commented Jan 31, 2025

Thanks for this PR. Are you happy to be an owner of the new k8s election code module? The core team doesn't have the time or expertise to take on ownership of this. On the whole this role is largely being responsible for reviewing code changes from other users and any dependency updates that require code changes.

Yes, I can take ownership

Copy link
Contributor

@mhutchinson mhutchinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Tomas, thanks for confirming you can take ownership of this submodule. That's great.

I've put a couple of comments on the existing files. My primary concern is that we don't break behaviour for existing users, and that we avoid too much entangling with introducing this new module. Once I'm happy with the changes around the old codebase I'll take a deeper dive review into the new module implementation.

BTW, one thing you can do that will make reviews far easier for me is to keep each round of commits in their own commit in the chain. We'll squash them together when merging at the end. This makes it much easier for me to step through the new code changes and be sure that no changes have been made to code I've already reviewed. Thanks!

cmd/internal/provider/default_systems.go Outdated Show resolved Hide resolved
util/election2/etcd/provider.go Outdated Show resolved Hide resolved
@mhutchinson mhutchinson marked this pull request as ready for review February 5, 2025 12:03
@mhutchinson mhutchinson requested a review from a team as a code owner February 5, 2025 12:03
@mhutchinson mhutchinson requested a review from roger2hk February 5, 2025 12:03
@osmman osmman requested a review from mhutchinson February 7, 2025 13:34
Copy link
Contributor

@mhutchinson mhutchinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks! See remaining comments around copyright dates etc, but I'm happy with this now.

The final remaining files to be touched put a bow on this and celebrate this feature and your contribution:

  • CONTRIBUTORS.md - add yourself here, if you like
  • /docs/Feature_Implementation_Matrix.md - please mention this feature and list yourself as a maintainer
  • CHANGELOG.md - note that there is now support for this so everyone can find out on the next release

@mhutchinson
Copy link
Contributor

/gcbrun

@osmman osmman requested a review from mhutchinson February 10, 2025 12:29
CODEOWNERS Outdated Show resolved Hide resolved
@mhutchinson
Copy link
Contributor

Once the change to CODEOWNERS is fixed, I'll kick off the CI workflows and when they pass, we can get this merged!

@mhutchinson
Copy link
Contributor

I'll also take this opportunity to encourage you to join the Slack community if that's an option for you. There's an invite link in the main README.

@osmman osmman requested a review from mhutchinson February 10, 2025 13:02
@mhutchinson
Copy link
Contributor

/gcbrun

@mhutchinson
Copy link
Contributor

Looks like there could be a race condition in the new code:

Step #6 - "presubmit_batched": ==================
Step #6 - "presubmit_batched": WARNING: DATA RACE
Step #6 - "presubmit_batched": Write at 0x00c000d989d0 by goroutine 75:
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Election).setObservedRecord()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:284 +0xdc
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Election).watchLease.func1()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:235 +0x385
Step #6 - "presubmit_batched": 
Step #6 - "presubmit_batched": Previous read at 0x00c000d989d0 by goroutine 74:
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Election).Resign()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:256 +0x2c6
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/testonly.(*Decorator).Resign()
Step #6 - "presubmit_batched":       /workspace/util/election2/testonly/decorator.go:98 +0x141
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/testonly.runElectionResign.func1()
Step #6 - "presubmit_batched":       /workspace/util/election2/testonly/tests.go:185 +0x519
Step #6 - "presubmit_batched":   testing.tRunner()
Step #6 - "presubmit_batched":       /usr/local/go/src/testing/testing.go:1690 +0x226
Step #6 - "presubmit_batched":   testing.(*T).Run.gowrap1()
Step #6 - "presubmit_batched":       /usr/local/go/src/testing/testing.go:1743 +0x44
Step #6 - "presubmit_batched": 
Step #6 - "presubmit_batched": Goroutine 75 (running) created at:
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Election).watchLease()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:214 +0x4dc
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Factory).NewElection()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:51 +0x478
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/testonly.runElectionResign.func1()
Step #6 - "presubmit_batched":       /workspace/util/election2/testonly/tests.go:169 +0x99
Step #6 - "presubmit_batched":   testing.tRunner()
Step #6 - "presubmit_batched":       /usr/local/go/src/testing/testing.go:1690 +0x226
Step #6 - "presubmit_batched":   testing.(*T).Run.gowrap1()
Step #6 - "presubmit_batched":       /usr/local/go/src/testing/testing.go:1743 +0x44
Step #6 - "presubmit_batched": 
Step #6 - "presubmit_batched": Goroutine 74 (running) created at:
Step #6 - "presubmit_batched":   testing.(*T).Run()
Step #6 - "presubmit_batched":       /usr/local/go/src/testing/testing.go:1743 +0x825
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/testonly.runElectionResign()
Step #6 - "presubmit_batched":       /workspace/util/election2/testonly/tests.go:167 +0x2c6
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.TestElection.func1()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election_test.go:224 +0x4c
Step #6 - "presubmit_batched":   testing.tRunner()
Step #6 - "presubmit_batched":       /usr/local/go/src/testing/testing.go:1690 +0x226
Step #6 - "presubmit_batched":   testing.(*T).Run.gowrap1()
Step #6 - "presubmit_batched":       /usr/local/go/src/testing/testing.go:1743 +0x44
Step #6 - "presubmit_batched": ==================
Step #6 - "presubmit_batched": ==================
Step #6 - "presubmit_batched": WARNING: DATA RACE
Step #6 - "presubmit_batched": Write at 0x00c0002d8650 by goroutine 106:
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Election).setObservedRecord()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:285 +0x186
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Election).watchLease.func1()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:235 +0x385
Step #6 - "presubmit_batched": 
Step #6 - "presubmit_batched": Previous read at 0x00c0002d8650 by goroutine 113:
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Election).isLeaseValid()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:275 +0x9a
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Election).WithMastership.func1()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:187 +0x271
Step #6 - "presubmit_batched": 
Step #6 - "presubmit_batched": Goroutine 106 (running) created at:
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Election).watchLease()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:214 +0x4dc
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Factory).NewElection()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:51 +0x478
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/testonly.runElectionLoop()
Step #6 - "presubmit_batched":       /workspace/util/election2/testonly/tests.go:257 +0x77
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.TestElection.func1()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election_test.go:224 +0x4c
Step #6 - "presubmit_batched":   testing.tRunner()
Step #6 - "presubmit_batched":       /usr/local/go/src/testing/testing.go:1690 +0x226
Step #6 - "presubmit_batched":   testing.(*T).Run.gowrap1()
Step #6 - "presubmit_batched":       /usr/local/go/src/testing/testing.go:1743 +0x44
Step #6 - "presubmit_batched": 
Step #6 - "presubmit_batched": Goroutine 113 (running) created at:
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Election).WithMastership()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:181 +0x3c9
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/testonly.(*Decorator).WithMastership()
Step #6 - "presubmit_batched":       /workspace/util/election2/testonly/decorator.go:88 +0x18c
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/testonly.runElectionLoop()
Step #6 - "presubmit_batched":       /workspace/util/election2/testonly/tests.go:274 +0x494
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.TestElection.func1()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election_test.go:224 +0x4c
Step #6 - "presubmit_batched":   testing.tRunner()
Step #6 - "presubmit_batched":       /usr/local/go/src/testing/testing.go:1690 +0x226
Step #6 - "presubmit_batched":   testing.(*T).Run.gowrap1()
Step #6 - "presubmit_batched":       /usr/local/go/src/testing/testing.go:1743 +0x44
Step #6 - "presubmit_batched": ==================
Step #6 - "presubmit_batched": ==================
Step #6 - "presubmit_batched": WARNING: DATA RACE
Step #6 - "presubmit_batched": Write at 0x00c0005ca650 by goroutine 148:
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Election).setObservedRecord()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:285 +0x186
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Election).watchLease.func1()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:237 +0x327
Step #6 - "presubmit_batched": 
Step #6 - "presubmit_batched": Previous read at 0x00c0005ca650 by goroutine 150:
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Election).isLeaseValid()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:275 +0x9a
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Election).WithMastership.func1()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:187 +0x271
Step #6 - "presubmit_batched": 
Step #6 - "presubmit_batched": Goroutine 148 (running) created at:
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Election).watchLease()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:214 +0x4dc
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Factory).NewElection()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:51 +0x478
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.runLeaseEvents.func1()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election_test.go:57 +0x12d
Step #6 - "presubmit_batched":   testing.tRunner()
Step #6 - "presubmit_batched":       /usr/local/go/src/testing/testing.go:1690 +0x226
Step #6 - "presubmit_batched":   testing.(*T).Run.gowrap1()
Step #6 - "presubmit_batched":       /usr/local/go/src/testing/testing.go:1743 +0x44
Step #6 - "presubmit_batched": 
Step #6 - "presubmit_batched": Goroutine 150 (running) created at:
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.(*Election).WithMastership()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election.go:181 +0x3c9
Step #6 - "presubmit_batched":   github.com/google/trillian/util/election2/k8s.runLeaseEvents.func1()
Step #6 - "presubmit_batched":       /workspace/util/election2/k8s/election_test.go:64 +0x2c3
Step #6 - "presubmit_batched":   testing.tRunner()
Step #6 - "presubmit_batched":       /usr/local/go/src/testing/testing.go:1690 +0x226
Step #6 - "presubmit_batched":   testing.(*T).Run.gowrap1()
Step #6 - "presubmit_batched":       /usr/local/go/src/testing/testing.go:1743 +0x44
Step #6 - "presubmit_batched": ==================
Step #6 - "presubmit_batched": --- FAIL: TestElection (3.96s)
Step #6 - "presubmit_batched":     --- FAIL: TestElection/RunElectionResign (0.11s)
Step #6 - "presubmit_batched":         --- FAIL: TestElection/RunElectionResign/master (0.01s)
Step #6 - "presubmit_batched":             testing.go:1399: race detected during execution of test
Step #6 - "presubmit_batched":     --- FAIL: TestElection/RunElectionLoop (1.08s)
Step #6 - "presubmit_batched":         tests.go:270: Mastership iteration: 0
Step #6 - "presubmit_batched":         tests.go:270: Mastership iteration: 1
Step #6 - "presubmit_batched":         tests.go:270: Mastership iteration: 2
Step #6 - "presubmit_batched":         tests.go:270: Mastership iteration: 3
Step #6 - "presubmit_batched":         tests.go:270: Mastership iteration: 4
Step #6 - "presubmit_batched":         tests.go:270: Mastership iteration: 5
Step #6 - "presubmit_batched":         tests.go:270: Mastership iteration: 6
Step #6 - "presubmit_batched":         tests.go:270: Mastership iteration: 7
Step #6 - "presubmit_batched":         tests.go:270: Mastership iteration: 8
Step #6 - "presubmit_batched":         tests.go:270: Mastership iteration: 9
Step #6 - "presubmit_batched":         testing.go:1399: race detected during execution of test
Step #6 - "presubmit_batched":     --- FAIL: TestElection/RunLeaseEvents (0.41s)
Step #6 - "presubmit_batched":         --- FAIL: TestElection/RunLeaseEvents/deleted (0.10s)
Step #6 - "presubmit_batched":             testing.go:1399: race detected during execution of test
Step #6 - "presubmit_batched": FAIL
Step #6 - "presubmit_batched": FAIL	github.com/google/trillian/util/election2/k8s	4.104s
Step #6 - "presubmit_batched": ok  	github.com/google/trillian/util/election2/testonly	2.447s
Step #6 - "presubmit_batched": FAIL

@osmman
Copy link
Contributor Author

osmman commented Feb 10, 2025

I modify the code to fix possible race condition on Election.observedTime.

go test -race ./util/election2/...
?       github.com/google/trillian/util/election2       [no test files]
ok      github.com/google/trillian/util/election2/etcd  4.665s
ok      github.com/google/trillian/util/election2/k8s   4.440s
ok      github.com/google/trillian/util/election2/testonly      3.136s

@osmman osmman requested a review from mhutchinson February 10, 2025 15:00
@mhutchinson
Copy link
Contributor

/gcbrun

@osmman
Copy link
Contributor Author

osmman commented Feb 10, 2025

I added an additional fix targeting the Resign method. Unfortunately, I wasn't able to reproduce the issue locally, so I'm hoping this change addresses a new data race condition recently detected by CI.

@mhutchinson
Copy link
Contributor

Another data race:

WARNING: DATA RACE
Write at 0x00c000696ac0 by goroutine 90:
  github.com/google/trillian/util/election2/k8s.(*Election).setObservedRecord()
      /workspace/util/election2/k8s/election.go:289 +0xdc
  github.com/google/trillian/util/election2/k8s.(*Election).watchLease.func1()
      /workspace/util/election2/k8s/election.go:235 +0x385

Previous read at 0x00c000696ac0 by goroutine 89:
  github.com/google/trillian/util/election2/k8s.(*Election).Resign()
      /workspace/util/election2/k8s/election.go:256 +0x2c6
  github.com/google/trillian/util/election2/k8s.(*Election).Close()
      /workspace/util/election2/k8s/election.go:271 +0x3a
  github.com/google/trillian/util/election2/testonly.(*Decorator).Close()
      /workspace/util/election2/testonly/decorator.go:108 +0x141
  github.com/google/trillian/util/election2/testonly.runElectionClose.func1()
      /workspace/util/election2/testonly/tests.go:240 +0x593
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1743 +0x44

Goroutine 90 (running) created at:
  github.com/google/trillian/util/election2/k8s.(*Election).watchLease()
      /workspace/util/election2/k8s/election.go:214 +0x4dc
  github.com/google/trillian/util/election2/k8s.(*Factory).NewElection()
      /workspace/util/election2/k8s/election.go:51 +0x478
  github.com/google/trillian/util/election2/testonly.runElectionClose.func1()
      /workspace/util/election2/testonly/tests.go:218 +0x99
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1743 +0x44

Goroutine 89 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:1743 +0x825
  github.com/google/trillian/util/election2/testonly.runElectionClose()
      /workspace/util/election2/testonly/tests.go:216 +0x2c6
  github.com/google/trillian/util/election2/k8s.TestElection.func1()
      /workspace/util/election2/k8s/election_test.go:224 +0x4c
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /usr/local/go/src/testing/testing.go:1743 +0x44
==================
--- FAIL: TestElection (3.39s)
    --- FAIL: TestElection/RunElectionClose (0.11s)
        --- FAIL: TestElection/RunElectionClose/master (0.00s)
            testing.go:1399: race detected during execution of test
FAIL
FAIL	github.com/google/trillian/util/election2/k8s	3.495s
ok  	github.com/google/trillian/util/election2/testonly	2.441s
FAIL

@mhutchinson
Copy link
Contributor

/gcbrun

@mhutchinson mhutchinson merged commit 738e4a5 into google:master Feb 10, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Leader election using Kubernetes leases
2 participants