-
Notifications
You must be signed in to change notification settings - Fork 867
feat: Heartbeat shard statistics #7431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Theis-Mathiassen
wants to merge
58
commits into
cadence-workflow:master
Choose a base branch
from
AndreasHolt:heartbeat-shard-statistics
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+347
−29
Open
Changes from all commits
Commits
Show all changes
58 commits
Select commit
Hold shift + click to select a range
2de12d8
feat(shard distributor): add shard key helpers and metrics state
AndreasHolt 5d95067
feat(shard distributor): persist shard metrics in etcd store
AndreasHolt 6e57536
fix(shard distributor): update LastMoveTime in the case where a shard…
AndreasHolt 595d320
test(shard distributor): add tests for shard metrics
AndreasHolt d9ba54d
fix(shard distributor): modify comment
AndreasHolt 32d2ecd
fix(shard distributor): add atomic check to prevent metrics race
AndreasHolt b624a00
fix(shard distributor): apply shard metric updates in a second phase …
AndreasHolt aad7b2e
feat(shard distributor): move shard metric updates out of AssignShard…
AndreasHolt 6360f8a
fix(shard distributor): keep NamespaceState revisions tied to assignm…
AndreasHolt 1536d0a
refactor(shard distributor): use shard cache and clock for preparing …
AndreasHolt f316fbf
test(shard distributor): BuildShardPrefix, BuildShardKey, ParseShardKey
AndreasHolt 4524da9
feat(shard distributor): simplify shard metrics updates
AndreasHolt 126f725
refactor(shard distributor): ShardMetrics renamed to ShardStatistics.…
AndreasHolt cc53f68
test(shard distributor): small changes to shard key tests s.t. they l…
AndreasHolt 733bbcb
fix(shard distributor): no longer check for key type ShardStatisticsK…
AndreasHolt 6816b8e
refactor(shard distributor): found a place where I forgot to rename t…
AndreasHolt f97e0cf
fix(shard distributor): move non-exported helpers to end of file to f…
AndreasHolt 513e88c
feat(shard distributor): clean up the shard statistics
AndreasHolt 9833525
test(shard distributor): add test case for when shard stats are deleted
AndreasHolt 0332fe5
fix(shard distributor): add mapping (new metric)
AndreasHolt d5a13d9
feat(shard distributor): retain shard stats while shards are within h…
AndreasHolt 634bc02
feat: function to update shard statistics from heartbeat (currently n…
AndreasHolt 812e854
test(shard distributor): add tests to verify statistics are updated a…
AndreasHolt b9813e7
feat(shard distributor): calculate smoothed load (ewma) using the Sha…
AndreasHolt dfb7448
fix(shard distributor): log invalid shard load
AndreasHolt 36ec08f
chore: added logger warning and simplified ewma calculation
Theis-Mathiassen 38a6e81
Merge branch 'master' into heartbeat-shard-statistics
Theis-Mathiassen af733e6
fix: remove duplicate test introduced in merge
AndreasHolt a52e86f
chore: consistent error checking, and rename function
Theis-Mathiassen abfc80e
chore: added decompress to unmarshal
Theis-Mathiassen df0feaf
feat(shard distributor): persist shard metrics in etcd store
AndreasHolt 8546a26
feat(shard distributor): move shard metric updates out of AssignShard…
AndreasHolt dde87ef
fix(shard distributor): keep NamespaceState revisions tied to assignm…
AndreasHolt 415e80c
refactor(shard distributor): use shard cache and clock for preparing …
AndreasHolt c67d5c3
feat(shard distributor): simplify shard metrics updates
AndreasHolt 9ffcefb
refactor(shard distributor): ShardMetrics renamed to ShardStatistics.…
AndreasHolt cc769bf
refactor(shard distributor): found a place where I forgot to rename t…
AndreasHolt 8c22663
fix(shard distributor): move non-exported helpers to end of file to f…
AndreasHolt 5ac3c5d
test(shard distributor): add test case for when shard stats are deleted
AndreasHolt 3973b82
feat(shard distributor): retain shard stats while shards are within h…
AndreasHolt 3830d5e
feat: function to update shard statistics from heartbeat (currently n…
AndreasHolt 443c0b1
test(shard distributor): add tests to verify statistics are updated a…
AndreasHolt 9d159e7
feat(shard distributor): calculate smoothed load (ewma) using the Sha…
AndreasHolt 18e63b7
fix(shard distributor): log invalid shard load
AndreasHolt e08a286
chore: added logger warning and simplified ewma calculation
Theis-Mathiassen 08eb635
fix: remove duplicate test introduced in merge
AndreasHolt f63664a
chore: consistent error checking, and rename function
Theis-Mathiassen 10e2ffa
chore: added decompress to unmarshal
Theis-Mathiassen 8c6b0c8
chore: removed an old struct that appeared during rebase
Theis-Mathiassen 158e030
feat(shard distributor): throttle shard-stat writes
AndreasHolt dd45ff0
Merge branch 'heartbeat-shard-statistics' of github.com:AndreasHolt/c…
AndreasHolt 05e0d1d
fix(shard distributor): linter error
AndreasHolt e0779ec
feat(shard distributor): decouple shard stats write-throttling decisi…
AndreasHolt 9546f24
Merge branch 'master' into heartbeat-shard-statistics
AndreasHolt db70702
fix(shard-distributor): inverted condition in shard stats cleanup loop
AndreasHolt 481f9c6
chore(shard-distributor): did some formatting, and use current load i…
Theis-Mathiassen f754dd6
Merge branch 'master' into heartbeat-shard-statistics
AndreasHolt 3366828
fix(shard distributor): decouple shard assignment from stats writes
AndreasHolt File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -186,3 +186,4 @@ shardDistribution: | |
| process: | ||
| period: 1s | ||
| heartbeatTTL: 2s | ||
| shardStatsTTL: 60s | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,6 +8,7 @@ import ( | |
| "encoding/json" | ||
| "errors" | ||
| "fmt" | ||
| "math" | ||
| "time" | ||
|
|
||
| clientv3 "go.etcd.io/etcd/client/v3" | ||
|
|
@@ -32,14 +33,20 @@ var ( | |
| const deleteShardStatsBatchSize = 64 | ||
|
|
||
| type executorStoreImpl struct { | ||
| client *clientv3.Client | ||
| prefix string | ||
| logger log.Logger | ||
| shardCache *shardcache.ShardToExecutorCache | ||
| timeSource clock.TimeSource | ||
| recordWriter *common.RecordWriter | ||
| client *clientv3.Client | ||
| prefix string | ||
| logger log.Logger | ||
| shardCache *shardcache.ShardToExecutorCache | ||
| timeSource clock.TimeSource | ||
| recordWriter *common.RecordWriter | ||
| maxStatsPersistInterval time.Duration // Max interval before we force a shard-stat persist. | ||
| } | ||
|
|
||
| // Constants for gating shard statistics writes to reduce etcd load. | ||
| const ( | ||
| shardStatsEpsilon = 0.05 | ||
| ) | ||
|
|
||
| // shardStatisticsUpdate holds the staged statistics for a shard so we can write them | ||
| // to etcd after the main AssignShards transaction commits. | ||
| type shardStatisticsUpdate struct { | ||
|
|
@@ -97,13 +104,19 @@ func NewStore(p ExecutorStoreParams) (store.Store, error) { | |
| return nil, fmt.Errorf("create record writer: %w", err) | ||
| } | ||
|
|
||
| shardStatsTTL := p.Cfg.Process.ShardStatsTTL | ||
| if shardStatsTTL <= 0 { | ||
| shardStatsTTL = config.DefaultShardStatsTTL | ||
| } | ||
|
|
||
| store := &executorStoreImpl{ | ||
| client: etcdClient, | ||
| prefix: etcdCfg.Prefix, | ||
| logger: p.Logger, | ||
| shardCache: shardCache, | ||
| timeSource: timeSource, | ||
| recordWriter: recordWriter, | ||
| client: etcdClient, | ||
| prefix: etcdCfg.Prefix, | ||
| logger: p.Logger, | ||
| shardCache: shardCache, | ||
| timeSource: timeSource, | ||
| recordWriter: recordWriter, | ||
| maxStatsPersistInterval: deriveStatsPersistInterval(shardStatsTTL), | ||
| } | ||
|
|
||
| p.Lifecycle.Append(fx.StartStopHook(store.Start, store.Stop)) | ||
|
|
@@ -165,9 +178,124 @@ func (s *executorStoreImpl) RecordHeartbeat(ctx context.Context, namespace, exec | |
| if err != nil { | ||
| return fmt.Errorf("record heartbeat: %w", err) | ||
| } | ||
|
|
||
| s.recordShardStatistics(ctx, namespace, executorID, request.ReportedShards) | ||
|
|
||
| return nil | ||
| } | ||
|
|
||
| func deriveStatsPersistInterval(shardStatsTTL time.Duration) time.Duration { | ||
| if shardStatsTTL <= time.Second { | ||
| return time.Second | ||
| } | ||
| return shardStatsTTL - time.Second | ||
| } | ||
|
|
||
| func (s *executorStoreImpl) recordShardStatistics(ctx context.Context, namespace, executorID string, reported map[string]*types.ShardStatusReport) { | ||
| if len(reported) == 0 { | ||
| return | ||
| } | ||
|
|
||
| now := s.timeSource.Now().UTC() | ||
|
|
||
| for shardID, report := range reported { | ||
| if report == nil { | ||
| s.logger.Warn("empty report; skipping EWMA update", | ||
| tag.ShardNamespace(namespace), | ||
| tag.ShardExecutor(executorID), | ||
| tag.ShardKey(shardID), | ||
| ) | ||
| continue | ||
| } | ||
|
|
||
| load := report.ShardLoad | ||
| if math.IsNaN(load) || math.IsInf(load, 0) { | ||
| s.logger.Warn( | ||
| "invalid shard load reported; skipping EWMA update", | ||
| tag.ShardNamespace(namespace), | ||
| tag.ShardExecutor(executorID), | ||
| tag.ShardKey(shardID), | ||
| ) | ||
| continue | ||
| } | ||
|
|
||
| shardStatsKey := etcdkeys.BuildShardKey(s.prefix, namespace, shardID, etcdkeys.ShardStatisticsKey) | ||
|
|
||
| statsResp, err := s.client.Get(ctx, shardStatsKey) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we are performing a get operation for each shard for each heartbeat and I am not sure etcd is handling this |
||
| if err != nil { | ||
| s.logger.Warn( | ||
| "failed to read shard statistics for heartbeat update", | ||
| tag.ShardNamespace(namespace), | ||
| tag.ShardExecutor(executorID), | ||
| tag.ShardKey(shardID), | ||
| tag.Error(err), | ||
| ) | ||
| continue | ||
| } | ||
|
|
||
| var stats etcdtypes.ShardStatistics | ||
| if len(statsResp.Kvs) > 0 { | ||
| if err := common.DecompressAndUnmarshal(statsResp.Kvs[0].Value, &stats); err != nil { | ||
| s.logger.Warn( | ||
| "failed to unmarshal shard statistics for heartbeat update", | ||
| tag.ShardNamespace(namespace), | ||
| tag.ShardExecutor(executorID), | ||
| tag.ShardKey(shardID), | ||
| tag.Error(err), | ||
| ) | ||
| continue | ||
| } | ||
| } | ||
|
|
||
| // Update smoothed load via EWMA. | ||
| prevSmoothed := stats.SmoothedLoad | ||
| prevUpdate := stats.LastUpdateTime.ToTime() | ||
| newSmoothed := ewmaSmoothedLoad(prevSmoothed, load, prevUpdate, now) | ||
|
|
||
| // Decide whether to persist this update. We always persist if this is the | ||
| // first observation (prevUpdate == 0). Otherwise, if the change is small | ||
| // and the previous persist is recent, skip the write to reduce etcd load. | ||
| shouldPersist := true | ||
| if !prevUpdate.IsZero() { | ||
| age := now.Sub(prevUpdate) | ||
| delta := math.Abs(newSmoothed - prevSmoothed) | ||
| if delta < shardStatsEpsilon && age < s.maxStatsPersistInterval { | ||
| shouldPersist = false | ||
| } | ||
| } | ||
|
|
||
| if !shouldPersist { | ||
| // Skip persisting, proceed to next shard. | ||
| continue | ||
| } | ||
|
|
||
| stats.SmoothedLoad = newSmoothed | ||
| stats.LastUpdateTime = etcdtypes.Time(now) | ||
|
|
||
| payload, err := json.Marshal(stats) | ||
| if err != nil { | ||
| s.logger.Warn( | ||
| "failed to marshal shard statistics after heartbeat", | ||
| tag.ShardNamespace(namespace), | ||
| tag.ShardExecutor(executorID), | ||
| tag.ShardKey(shardID), | ||
| tag.Error(err), | ||
| ) | ||
| continue | ||
| } | ||
|
|
||
| if _, err := s.client.Put(ctx, shardStatsKey, string(payload)); err != nil { | ||
| s.logger.Warn( | ||
| "failed to persist shard statistics from heartbeat", | ||
| tag.ShardNamespace(namespace), | ||
| tag.ShardExecutor(executorID), | ||
| tag.ShardKey(shardID), | ||
| tag.Error(err), | ||
| ) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // GetHeartbeat retrieves the last known heartbeat state for a single executor. | ||
| func (s *executorStoreImpl) GetHeartbeat(ctx context.Context, namespace string, executorID string) (*store.HeartbeatState, *store.AssignedState, error) { | ||
| // The prefix for all keys related to a single executor. | ||
|
|
@@ -476,9 +604,7 @@ func (s *executorStoreImpl) AssignShard(ctx context.Context, namespace, shardID, | |
| } | ||
|
|
||
| now := s.timeSource.Now().UTC() | ||
| statsModRevision := int64(0) | ||
| if len(statsResp.Kvs) > 0 { | ||
| statsModRevision = statsResp.Kvs[0].ModRevision | ||
| if err := common.DecompressAndUnmarshal(statsResp.Kvs[0].Value, &shardStats); err != nil { | ||
| return fmt.Errorf("parse shard statistics: %w", err) | ||
| } | ||
|
|
@@ -543,7 +669,6 @@ func (s *executorStoreImpl) AssignShard(ctx context.Context, namespace, shardID, | |
| comparisons = append(comparisons, clientv3.Compare(clientv3.ModRevision(statusKey), "=", statusModRev)) | ||
| // b) Check that neither the assigned_state nor shard statistics were modified concurrently. | ||
| comparisons = append(comparisons, clientv3.Compare(clientv3.ModRevision(assignedState), "=", modRevision)) | ||
| comparisons = append(comparisons, clientv3.Compare(clientv3.ModRevision(shardStatsKey), "=", statsModRevision)) | ||
| // c) Check that the cache is up to date. | ||
| cmp, err := s.shardCache.GetExecutorModRevisionCmp(namespace) | ||
| if err != nil { | ||
|
|
@@ -764,3 +889,16 @@ func (s *executorStoreImpl) applyShardStatisticsUpdates(ctx context.Context, nam | |
| } | ||
| } | ||
| } | ||
|
|
||
| func ewmaSmoothedLoad(prev, current float64, lastUpdate, now time.Time) float64 { | ||
| const tau = 30 * time.Second // smaller = more responsive, larger = smoother | ||
| if lastUpdate.IsZero() || tau <= 0 { | ||
| return current | ||
| } | ||
| if now.Before(lastUpdate) { | ||
| return current | ||
| } | ||
| dt := now.Sub(lastUpdate) | ||
| alpha := 1 - math.Exp(-dt.Seconds()/tau.Seconds()) | ||
| return (1-alpha)*prev + alpha*current | ||
| } | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss if this should happen syncrhronously when we record heartbeat