Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance: add rw/ro streaming query node replica management #38677

Merged

Conversation

chyezh
Copy link
Contributor

@chyezh chyezh commented Dec 24, 2024

issue: #38399

  • Embed the query node into streaming node to make delegator available at streaming node.
  • The embedded query node has a special server label QUERYNODE_STREAMING-EMBEDDED.
  • Change the balance strategy to make the channel assigned to streaming node as much as possible.

@sre-ci-robot sre-ci-robot added area/internal-api size/XL Denotes a PR that changes 500-999 lines. labels Dec 24, 2024
@mergify mergify bot added dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement labels Dec 24, 2024
Copy link
Contributor

mergify bot commented Dec 24, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh chyezh force-pushed the enhance_embed_querynode_in_streamingnode branch from c3ce6ad to 0c32ff6 Compare December 24, 2024 02:44
Copy link
Contributor

mergify bot commented Dec 24, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

1 similar comment
Copy link
Contributor

mergify bot commented Dec 24, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Copy link

codecov bot commented Dec 24, 2024

Codecov Report

Attention: Patch coverage is 75.53816% with 125 lines in your changes missing coverage. Please review.

Project coverage is 81.14%. Comparing base (e5eb115) to head (1f5b4ca).
Report is 117 commits behind head on master.

Files with missing lines Patch % Lines
...ernal/querycoordv2/balance/multi_target_balance.go 0.00% 34 Missing ⚠️
...al/querycoordv2/balance/rowcount_based_balancer.go 70.58% 10 Missing and 5 partials ⚠️
internal/querycoordv2/ops_services.go 23.52% 9 Missing and 4 partials ⚠️
...erycoordv2/balance/channel_level_score_balancer.go 18.18% 8 Missing and 1 partial ⚠️
...ternal/querycoordv2/meta/replica_manager_helper.go 82.35% 8 Missing and 1 partial ⚠️
internal/querycoordv2/server.go 12.50% 5 Missing and 2 partials ⚠️
...dv2/balance/streaming_query_node_channel_helper.go 85.00% 4 Missing and 2 partials ⚠️
internal/querycoordv2/meta/replica_manager.go 87.23% 4 Missing and 2 partials ⚠️
pkg/util/merr/utils.go 0.00% 6 Missing ⚠️
...nternal/querycoordv2/observers/replica_observer.go 90.56% 3 Missing and 2 partials ⚠️
... and 5 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #38677      +/-   ##
==========================================
+ Coverage   81.11%   81.14%   +0.02%     
==========================================
  Files        1395     1397       +2     
  Lines      197298   197712     +414     
==========================================
+ Hits       160037   160427     +390     
- Misses      31623    31646      +23     
- Partials     5638     5639       +1     
Components Coverage Δ
Client 79.53% <ø> (ø)
Core 69.64% <ø> (ø)
Go 83.06% <75.53%> (+0.02%) ⬆️
Files with missing lines Coverage Δ
...ernal/querycoordv2/balance/score_based_balancer.go 98.78% <100.00%> (+0.71%) ⬆️
internal/querycoordv2/checkers/channel_checker.go 87.35% <100.00%> (+0.22%) ⬆️
internal/querycoordv2/meta/replica.go 100.00% <100.00%> (ø)
internal/querycoordv2/session/node_manager.go 100.00% <100.00%> (ø)
...al/streamingcoord/server/balancer/balancer_impl.go 77.99% <100.00%> (+0.76%) ⬆️
internal/streamingcoord/server/server.go 72.22% <100.00%> (+0.52%) ⬆️
internal/util/sessionutil/session_util.go 76.51% <ø> (ø)
...ode/server/flusher/flusherimpl/channel_lifetime.go 71.27% <0.00%> (ø)
...al/coordinator/snmanager/streaming_node_manager.go 94.73% <94.73%> (ø)
internal/querycoordv2/balance/balance.go 91.20% <70.00%> (-2.91%) ⬇️
... and 12 more

... and 30 files with indirect coverage changes

Copy link
Contributor

mergify bot commented Dec 24, 2024

@chyezh go-sdk check failed, comment rerun go-sdk can trigger the job again.

@chyezh chyezh force-pushed the enhance_embed_querynode_in_streamingnode branch from 5d38958 to 781ff46 Compare December 26, 2024 13:36
Copy link
Contributor

mergify bot commented Dec 26, 2024

@chyezh go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Dec 26, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh chyezh force-pushed the enhance_embed_querynode_in_streamingnode branch from 781ff46 to 3f613f6 Compare December 27, 2024 16:11
Copy link
Contributor

mergify bot commented Dec 27, 2024

@chyezh go-sdk check failed, comment rerun go-sdk can trigger the job again.

@chyezh chyezh force-pushed the enhance_embed_querynode_in_streamingnode branch 2 times, most recently from 9db9bcf to 3953e7c Compare December 28, 2024 11:47
Copy link
Contributor

mergify bot commented Dec 28, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh chyezh force-pushed the enhance_embed_querynode_in_streamingnode branch 2 times, most recently from 70a9f34 to 2544e08 Compare December 29, 2024 02:46
Copy link
Contributor

mergify bot commented Dec 29, 2024

@chyezh go-sdk check failed, comment rerun go-sdk can trigger the job again.

1 similar comment
Copy link
Contributor

mergify bot commented Dec 29, 2024

@chyezh go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Dec 29, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@sre-ci-robot sre-ci-robot added size/XXL Denotes a PR that changes 1000+ lines. and removed size/XL Denotes a PR that changes 500-999 lines. labels Dec 29, 2024
Copy link
Contributor

mergify bot commented Dec 29, 2024

@chyezh go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Dec 29, 2024

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh chyezh force-pushed the enhance_embed_querynode_in_streamingnode branch from 413fd4b to 1546f5a Compare December 29, 2024 11:16
@chyezh chyezh force-pushed the enhance_embed_querynode_in_streamingnode branch from 99e8552 to e22ad87 Compare January 7, 2025 13:09
@mergify mergify bot added the ci-passed label Jan 7, 2025
@chyezh chyezh force-pushed the enhance_embed_querynode_in_streamingnode branch from e22ad87 to 789b8dc Compare January 10, 2025 02:58
@mergify mergify bot removed the ci-passed label Jan 10, 2025
Copy link
Contributor

mergify bot commented Jan 10, 2025

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh chyezh force-pushed the enhance_embed_querynode_in_streamingnode branch 2 times, most recently from 5dfd0d7 to 1f5b4ca Compare January 10, 2025 04:22
Copy link
Contributor

mergify bot commented Jan 10, 2025

@chyezh go-sdk check failed, comment rerun go-sdk can trigger the job again.

@chyezh
Copy link
Contributor Author

chyezh commented Jan 10, 2025

rerun go-sdk

Copy link
Contributor

mergify bot commented Jan 10, 2025

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh
Copy link
Contributor Author

chyezh commented Jan 10, 2025

/run-cpu-e2e

Copy link
Contributor

mergify bot commented Jan 10, 2025

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh
Copy link
Contributor Author

chyezh commented Jan 10, 2025

/run-cpu-e2e

1 similar comment
@chyezh
Copy link
Contributor Author

chyezh commented Jan 10, 2025

/run-cpu-e2e

Copy link
Contributor

mergify bot commented Jan 10, 2025

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh
Copy link
Contributor Author

chyezh commented Jan 10, 2025

/run-cpu-e2e

Copy link
Contributor

mergify bot commented Jan 10, 2025

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chyezh
Copy link
Contributor Author

chyezh commented Jan 10, 2025

/run-cpu-e2e

@mergify mergify bot added the ci-passed label Jan 10, 2025
s.cond.L.Unlock()
return nil
}); err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although here only context cancel error, I still suggest adding a log before the return


// StreamingNodeManager is a manager for manage the querynode that embedded into streaming node.
// StreamingNodeManager is exclusive with ResourceManager.
type StreamingNodeManager struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use StreamingNodeObserver instead of StreamingNodeManager

@chyezh chyezh added this to the 2.6.0 milestone Jan 23, 2025
Copy link
Member

@liliu-z liliu-z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@@ -175,6 +177,10 @@ func GetMilvusRoles(args []string, flags *flag.FlagSet) *roles.MilvusRoles {
role.EnableIndexNode = enableIndexNode
role.EnableProxy = enableProxy
role.EnableStreamingNode = enableStreamingNode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we don't check IsStreamingServiceEnabled here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If enableStreamingNode is setup, IsStreamingServiceEnabled is always true for mixture.
Because the streaming service must be enabled at 2.6, we will remove all checker before 2.6 release.
But here is huge amount related-unittest modification, the removing pr is deferred.

@@ -150,7 +150,9 @@ func GetMilvusRoles(args []string, flags *flag.FlagSet) *roles.MilvusRoles {
role.EnableIndexNode = true
case typeutil.StreamingNodeRole:
streamingutil.MustEnableStreamingService()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant check

"github.com/milvus-io/milvus/pkg/util/typeutil"
)

var StaticStreamingNodeManager = newStreamingNodeManager()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't use Sync.Once. Plz CMIIW, this will be inited no matter what kind of nodes

Copy link
Contributor Author

@chyezh chyezh Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a static-initialized variable, no busy operation happened when init, no concurrent issue here.
So sync.Once is not applied, but a redundant dead goroutine here.
It should be only available at coordinator, I will fix it.

// EnableEmbededQueryNode set server labels for embedded query node.
func EnableEmbededQueryNode() {
MustEnableStreamingService()
os.Setenv(sessionutil.SupportedLabelPrefix+sessionutil.LabelStreamingNodeEmbeddedQueryNode, "1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curios why we set a env variable instead of a global variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both env and global var is ok here.
i will modify to use global var at another pr.

@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chyezh, liliu-z

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot sre-ci-robot merged commit c84a074 into milvus-io:master Jan 24, 2025
19 of 20 checks passed
@chyezh chyezh deleted the enhance_embed_querynode_in_streamingnode branch January 25, 2025 04:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved area/internal-api area/test ci-passed dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement lgtm sig/testing size/XXL Denotes a PR that changes 1000+ lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants