Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xdsclient: update watcher API as per gRFC A88 #7977

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

purnesh42H
Copy link
Contributor

@purnesh42H purnesh42H commented Jan 2, 2025

This is the first part of implementing gRFC A88 (grpc/proposal#466).

This introduces the new watcher API but does not change any of the existing behavior. This table summarizes the API changes and behavior for each case:

Case Old API New API Behavior
Resource timer fires OnResourceDoesNotExist() OnResourceChanged(NOT_FOUND) Fail data plane RPCs
LDS or CDS resource deletion OnResourceDoesNotExist() OnResourceChanged(NOT_FOUND) Drop resource and fail data plane RPCs
xDS channel reports TRANSIENT_FAILURE OnError() OnResourceChanged(status) if resource NOT already cached; OnAmbientError(status) otherwise Continue using cached resource, if any; otherwise, fail data plane RPCs
ADS stream terminates without receiving a response OnError() OnResourceChanged(status) if resource NOT already cached; OnAmbientError(status) otherwise Continue using cached resource, if any; otherwise, fail data plane RPCs
Invalid resource update (client NACK) OnError() OnResourceChanged(status) if resource NOT already cached; OnAmbientError(status) otherwise Continue using cached resource, if any; otherwise, fail data plane RPCs
Valid resource update OnUpdate(resource) OnResourceChanged(resource) use the new resource

RELEASE NOTES:

  • xds: TBD

@purnesh42H purnesh42H added Type: Feature New features or improvements in behavior Area: xDS Includes everything xDS related, including LB policies used with xDS. labels Jan 2, 2025
@purnesh42H purnesh42H added this to the 1.70 Release milestone Jan 2, 2025
@purnesh42H purnesh42H self-assigned this Jan 2, 2025
@purnesh42H purnesh42H force-pushed the a88-watcher-api branch 7 times, most recently from 8d198b7 to a9b45c0 Compare January 3, 2025 16:28
@purnesh42H purnesh42H requested a review from markdroth January 3, 2025 16:58
@purnesh42H purnesh42H assigned dfawley and markdroth and unassigned purnesh42H Jan 3, 2025
@purnesh42H purnesh42H requested a review from dfawley January 3, 2025 16:59
@purnesh42H purnesh42H force-pushed the a88-watcher-api branch 2 times, most recently from 4009e3e to 57dbf23 Compare January 6, 2025 12:03
Copy link

codecov bot commented Jan 6, 2025

Codecov Report

Attention: Patch coverage is 81.36646% with 30 lines in your changes missing coverage. Please review.

Project coverage is 82.45%. Comparing base (724f450) to head (b9d2a92).
Report is 44 commits behind head on master.

Files with missing lines Patch % Lines
xds/internal/testutils/resource_watcher.go 20.00% 11 Missing and 1 partial ⚠️
xds/internal/server/listener_wrapper.go 42.10% 9 Missing and 2 partials ⚠️
.../balancer/clusterresolver/resource_resolver_eds.go 78.57% 2 Missing and 1 partial ⚠️
xds/internal/resolver/xds_resolver.go 66.66% 2 Missing ⚠️
xds/internal/xdsclient/clientimpl_watchers.go 33.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7977      +/-   ##
==========================================
+ Coverage   82.28%   82.45%   +0.17%     
==========================================
  Files         381      388       +7     
  Lines       38539    39048     +509     
==========================================
+ Hits        31712    32198     +486     
+ Misses       5535     5530       -5     
- Partials     1292     1320      +28     
Files with missing lines Coverage Δ
xds/internal/balancer/cdsbalancer/cdsbalancer.go 81.69% <100.00%> (-0.07%) ⬇️
...s/internal/balancer/cdsbalancer/cluster_watcher.go 100.00% <100.00%> (ø)
xds/internal/resolver/watch_service.go 100.00% <100.00%> (+8.57%) ⬆️
xds/internal/server/rds_handler.go 90.69% <100.00%> (-0.61%) ⬇️
xds/internal/xdsclient/authority.go 77.21% <100.00%> (+0.34%) ⬆️
...nal/xdsclient/xdsresource/cluster_resource_type.go 78.37% <100.00%> (+1.23%) ⬆️
...l/xdsclient/xdsresource/endpoints_resource_type.go 76.47% <100.00%> (+1.47%) ⬆️
...al/xdsclient/xdsresource/listener_resource_type.go 86.44% <100.00%> (+0.47%) ⬆️
...ds/internal/xdsclient/xdsresource/resource_type.go 100.00% <ø> (ø)
...dsclient/xdsresource/route_config_resource_type.go 76.47% <100.00%> (+1.47%) ⬆️
... and 5 more

... and 58 files with indirect coverage changes

@purnesh42H purnesh42H force-pushed the a88-watcher-api branch 4 times, most recently from 43c9adb to 89f475a Compare January 6, 2025 12:26
@purnesh42H purnesh42H requested a review from easwars January 10, 2025 16:35
@purnesh42H purnesh42H assigned easwars and unassigned purnesh42H Jan 10, 2025
xds/internal/xdsclient/xdsresource/resource_type.go Outdated Show resolved Hide resolved
xds/internal/xdsclient/xdsresource/resource_type.go Outdated Show resolved Hide resolved
xds/internal/resolver/xds_resolver.go Outdated Show resolved Hide resolved
xds/internal/resolver/xds_resolver.go Outdated Show resolved Hide resolved
@easwars easwars assigned purnesh42H and unassigned easwars, markdroth and dfawley Jan 10, 2025
xds/internal/xdsclient/xdsresource/resource_type.go Outdated Show resolved Hide resolved
xds/internal/xdsclient/xdsresource/resource_type.go Outdated Show resolved Hide resolved
Comment on lines 107 to 111
// If resource is already cached, it is invoked under different error
// conditions including but not limited to the following:
// - resource validation error
// - ADS stream failure
// - connection failure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also depends on a number of things like the server feature being set etc. So, I think we can get rid of this block as well. Somewhere in ResourceWatcher documentation, we should link to A88 (probably at the top) and say that it contains an exhaustive list of what method is invoked under what conditions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the documentation of the individual watchers as well based on the two comments here. Thanks.

Copy link
Contributor Author

@purnesh42H purnesh42H Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed it. I have mentioned gRFC A88 at the top to refer for exhaustive list. I saw other places in code where we just mention the grfc and don't link. Also, the proposal is not merged yet.

One more i felt worth mentioning is about whether to continue reusing the existing resource or not. For OnAmbientError(), they may continue using it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The language from the gFRC is much stricter than yours:

In particular, the watcher should not stop using the previously seen resource, and the XdsClient will not remove the resource from its cache

So, please mention that the watcher should not stop using the previously seen resource and not the watcher may continue using the previously seen resource

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated.

@easwars easwars assigned purnesh42H and unassigned easwars Jan 30, 2025
@purnesh42H purnesh42H requested a review from easwars January 31, 2025 11:15
@purnesh42H purnesh42H assigned easwars and unassigned purnesh42H Jan 31, 2025
@purnesh42H
Copy link
Contributor Author

@easwars ptal

@@ -525,15 +525,16 @@ func (b *cdsBalancer) onClusterError(name string, err error) {
// TRANSIENT_FAILURE.
//
// Only executed in the context of a serializer callback.
func (b *cdsBalancer) onClusterResourceNotFound(name string) {
err := xdsresource.NewErrorf(xdsresource.ErrorTypeResourceNotFound, "resource name %q of type Cluster not found in received response", name)
func (b *cdsBalancer) onClusterResourceChangedError(name string, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment at the top of this method needs to be updated. This method now handles not only resource-not-found errors, but other types of errors as well that instruct the LB policy to stop using the previously received resource.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Also, updated the onClusterAmbient error to not stop using the previously seen resource

@purnesh42H purnesh42H requested a review from easwars February 5, 2025 08:50
@@ -194,7 +194,7 @@ func (a *authority) handleADSStreamFailure(serverConfig *bootstrap.ServerConfig,
for watcher := range state.watchers {
watcher := watcher
a.watcherCallbackSerializer.TrySchedule(func(context.Context) {
watcher.OnError(xdsresource.NewErrorf(xdsresource.ErrorTypeConnection, "xds: error received from xDS stream: %v", err), func() {})
watcher.OnAmbientError(xdsresource.NewErrorf(xdsresource.ErrorTypeConnection, "xds: error received from xDS stream: %v", err), func() {})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the PR description, for the following two cases:

  • xDS channel reports TRANSIENT_FAILURE |
  • ADS stream terminates without receiving a response
    the old behavior was to call OnError(), while the new behavior is to call OnResourceChanged(status) if resource NOT already cached, and OnAmbientError(status) otherwise.

But, we are only calling the latter here. Am I missing something?

@easwars
Copy link
Contributor

easwars commented Feb 6, 2025

This came up during some other discussion with @dfawley.

  1. Callback methods named OnXxx are more of a C++ style. In Go, we would probably just name them Xxx.
  2. Whether we should have two callback methods, one named ResourceChanged(ResourceData), and the other named ResourceError(error). This is to work around the fact that we don't have a StatusOr type in Go, and without that, having two APIs is better than one API (where we need to add a new struct to hold one of two possible values, with no guarantees that only one of the two values is set).

@easwars easwars assigned purnesh42H and unassigned easwars Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants