Linkerd Multicluster Authentication and HTTP Balancer failures #7115
-
Good morning -- my team is having an issue with linkerd multicluster and aren't sure how to fix or even diagnose the issue. The linkerd gateway pod for multicluster (namespace: linkerd-multicluster) keeps throwing these errors repeatedly (using linkerd 2.10): [ 276.621867s] WARN ThreadId(01) inbound:accept{client.addr=:42340 target.addr=:4143}: linkerd_app_core::errors: Failed to proxy request: HTTP Balancer service in fail-fast We re-created the trust anchor and issuing certificates for the cluster and applied them to both clusters. Although we relinked them, and linkerd check shows multicluster everything as OK (indeed, the whole cluster shows as "OK"), we continue to see these errors and can observe connectivity issues where services on one cluster cannot query services on the other. To us, this smells like an SSL mTLS issue since it says Direct connections must be mutually authenticated, but with completely re-generated certs, this shouldn't be an issue I think. What would we need to do / what should we check for new information? Some discussion from Slack below (Thanks William!): william: I’m not a multicluster expert by any means, but as a first guess, is the service itself alive and responding to requests? We've been working at this for days and at the moment, we're looking at a full reinstall of linkerd. Any advice on how we can begin to diagnose these very general errors we're getting despite linkerd check giving OK on everything? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
If this was a trust root issue, I'd expect the error to be different. We'd probably expect TLS validation errors. In this case, it sounds like the client isn't trying to establish TLS to the gateway. You could try increasing the log level with an annotation like What versions of Linkerd are you running? Do |
Beta Was this translation helpful? Give feedback.
-
Using linkerd 2.10 (as 2.11 had issues with getting past init which I believe are being worked on). All linkerd checks pass, including the --proxy and multicluster. We do get a warning about the version, as expected. We did eventually find out about the proxy-log-level, but we didn't include info, so thanks! Here's the info above a HTTP Balancer error:
Here's some output from out mutual authentication required and identity required issues:
|
Beta Was this translation helpful? Give feedback.
-
Thanks olix0r for taking a look at this - apparently all of these issues are indicative of our network policy not being applied to our linkerd-multicluster namespace... After 5 days of debugging, we noticed that our network policy was looking for a label that was not included on our namespace. It looks like this label was manually applied when the cluster was created ages ago. When we reinstalled linkerd-multicluster, the namespace was deleted, hence the label was deleted too. Thus, none of the pods in linkerd-multicluster were able to talk to the rest of our cluster and vice-versa. This also means that all the linkerd checks passed (as the linkerd-multicluster could reach the other cluster), but none of the other pods could get there because they couldn't communicate with anything in the linkerd-multicluster namespace. Perhaps we want to add a check to linkerd multicluster check to make sure linkerd proxy containers can reach linkerd-gateway? |
Beta Was this translation helpful? Give feedback.
Thanks olix0r for taking a look at this - apparently all of these issues are indicative of our network policy not being applied to our linkerd-multicluster namespace...
After 5 days of debugging, we noticed that our network policy was looking for a label that was not included on our namespace. It looks like this label was manually applied when the cluster was created ages ago. When we reinstalled linkerd-multicluster, the namespace was deleted, hence the label was deleted too. Thus, none of the pods in linkerd-multicluster were able to talk to the rest of our cluster and vice-versa. This also means that all the linkerd checks passed (as the linkerd-multicluster could reach the other clust…