-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GRPC Demo #798
base: main
Are you sure you want to change the base?
GRPC Demo #798
Conversation
@@ -40,14 +40,6 @@ jobs: | |||
- name: Build image | |||
run: | | |||
make update | |||
# TODO: Re-enable this when json schemas are ready to come back |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just removing some cruft, not relevant
@@ -1,4 +1,9 @@ | |||
FROM golang:1.20.0 as go-build | |||
RUN mkdir go && cd go && export GOPATH=/go && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tool that helps setup a mTLS key chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With mTLS requiring more infrastructure and cert management, we should also consider having a solution for bearer token based authentication with support for no-downtime token rotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But here's the thing -- token based authentication will, I feel, ultimately be even more infrastructure and management.
Once we start having multiple service to service sites, and once we may want things like ACLs (for instance, seer shouldn't have access to hybrid cloud endpoints), you need identity management, which token doesn't have baked in. We'd have to /build/ all that out, and we'd /still/ end up with infra tooling that allowed no downtime token rotations in the face of that identity management. Which we have to implement in each place.
With mTLS, we have one workflow, with tools that already exist, that is industry standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use JWTs to bake in identities and carry end-user context into downstream services. We could really get a lot of benefit with the added context in the multi-tenant environment and any time we cross some sort of domain trust boundary.
mTLS alone does not lend itself well to nuanced authorization decisions, at least in my opinion. A combination of mTLS for service-to-service authentication (PeerAuthentication and using JWT to carry end-user context (RequestAuthentication) is more robust than just one or the other. It's a fairly common pattern from my research. AuthorizationPolicies can be used to introspect the JWT's claims and make authorization decisions at the service mesh level.
In terms of added infrastructure, we'd be looking at some sort of trusted Secure Token Service (STS) to support the minting of JWTs. Tokens are issued through standard OAuth flows, and services would just implement JWT verification (if we decided to not just let the service mesh do this) and logic to pass it on where necessary.
Of course, implementing one thing at a time is probably best. 🙂
PROTOS_OUT=protos/build | ||
PROTOS_SOURCE=protos/src | ||
.PHONY: protos | ||
protos: ## Regenerate protobuf python files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this workflow, all it takes is installing (pip install) to get both the protos and the generated files together, making it relatively simple to both deploy and use in development.
@@ -1,5 +1,16 @@ | |||
# seer | |||
|
|||
## gRPC experiment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Will remove, I don't intend to merge this into master. I have some ideas how we can break this out into a workable workflow and separate concerns.
@@ -0,0 +1,70 @@ | |||
static_resources: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using envoy here because I know it has decent GKE support (https://cloud.google.com/kubernetes-engine/docs/tutorials/exposing-grpc-services-on-gke-using-envoy-proxy)
But to be honest, it wouldn't have to be GKE if we didn't want it to. Here, it handles the mTLS and the http 2 -> 1.1 proxying, but I'm confident both are also configurable in k8s with many other filters. That said, envoy is pretty nice and contains a lot of nice advantages, including decent XDS configuration. For service to service at scale, I feel inclined to bring this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have envoy in a few places in our infrastructure already. Using envoy for grpc service discovery and mtls termination should be ok. If we can avoid downgrading to HTTP1.1 I think it would be worth the extra effort.
@@ -0,0 +1,237 @@ | |||
import contextlib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a minimal, demo of creating a custom codegen tool. I didn't go out of my way to make it pretty, it could definitely be made nicer if we committed to such approaches.
@@ -0,0 +1,115 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: I do not actually indent to commit any of the generated files (everyting in protos/py
) in practice, but I did so in this demo for demonstrative purpose of the output.
self, proto_request: ScoreRequest, context: grpc.ServicerContext | ||
) -> ScoreResponse: | ||
identities = context.peer_identities() | ||
if not identities or b"consumer" not in identities: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a simple demo of how authorization can use peer information from mTLS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having to do authentication checks in each RPC method will be error prone. gRPC has interceptors which seem like a good fit for authentication methods.
context.abort(grpc.StatusCode.PERMISSION_DENIED, "only consumer can access") | ||
|
||
request = SeverityRequest() | ||
request.adapt_from(proto_request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example of a realistic rollout case: we have an existing endpoint with production clients we want to keep working, that uses pydantic
classes. We create and use a generated adaptor to convert between protos and the pydantic classes during the rollout so we can incrementally change the client side to use protos natively. After that, the implementation itself could use the protos natively and possibly discard the json endpoints if they are not necessary.
It is also possible to use native proto <-> json format tools, but there are some caveats we should discuss. I think in either case these adaptors are safe tools for rolling out to existing endpoints.
|
||
from seer.severity.severity_inference import SeverityRequest | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I've implemented a custom flask transport for grpc. Because we are using http 2 -> http1.1 bridge, we end up needing to implement some grpc server details here to support that. To use the native gRPC server, we'd have to support http 2, and we'd likely have to run separate servers in situations where we are deploying it alongside an existing django or flask app.
This compromise demonstrates that if we really want to commit one server for both gRPC and our existing http 1.1 servers, it is possible, with a modest amount of leg work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in production we would want separate deployments for RPC and web traffic. However, in local development having a single server would be nice to have.
# response.raise_for_status() | ||
# print(response.json()) | ||
|
||
with grpc.secure_channel( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fortunately, grpc clients will still natively use http 2 with grpc and do not need to be aware of how we implement the http 1.1 bridge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: it's also possible to use envoy as a front end as well, meaning that application codes would not need their own credentials files. In this sort of deployment, the envoy sidecar container contains the actual client tls configuration and the application code makes an insecure channel directly to the sidecar, which handles mtls on behalf.
pred = self.classifier.predict_proba(input_data)[0][1] | ||
|
||
return SeverityResponse(severity=round(min(1.0, max(0.0, pred)), 2)) | ||
return SeverityResponse(severity=0.6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make the demo simpler, disabling the actual inference.
if not identities or b"consumer" not in identities: | ||
context.abort(grpc.StatusCode.PERMISSION_DENIED, "only consumer can access") | ||
|
||
request = SeverityRequest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you could almost make this a middleware or passthrough annotation that handles conversion for you if necessary
@@ -1,4 +1,9 @@ | |||
FROM golang:1.20.0 as go-build | |||
RUN mkdir go && cd go && export GOPATH=/go && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With mTLS requiring more infrastructure and cert management, we should also consider having a solution for bearer token based authentication with support for no-downtime token rotation.
@@ -0,0 +1,70 @@ | |||
static_resources: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have envoy in a few places in our infrastructure already. Using envoy for grpc service discovery and mtls termination should be ok. If we can avoid downgrading to HTTP1.1 I think it would be worth the extra effort.
self, proto_request: ScoreRequest, context: grpc.ServicerContext | ||
) -> ScoreResponse: | ||
identities = context.peer_identities() | ||
if not identities or b"consumer" not in identities: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having to do authentication checks in each RPC method will be error prone. gRPC has interceptors which seem like a good fit for authentication methods.
|
||
from seer.severity.severity_inference import SeverityRequest | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in production we would want separate deployments for RPC and web traffic. However, in local development having a single server would be nice to have.
@@ -0,0 +1,115 @@ | |||
""" | |||
@generated by mypy-protobuf. Do not edit manually! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't these classes generated by your code?
self.apply_to_message(proto) | ||
self.apply_to_has_stacktrace(proto) | ||
self.apply_to_handled(proto) | ||
self.apply_to_trigger_timeout(proto) | ||
self.apply_to_trigger_error(proto) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these method calls also be passing the val
parameter?
class FromScoreRequestAdaptor: | ||
def adapt_from_message(self, value: builtins.str): | ||
pass | ||
|
||
def adapt_from_has_stacktrace(self, value: builtins.int): | ||
pass | ||
|
||
def adapt_from_handled(self, value: builtins.bool): | ||
pass | ||
|
||
def adapt_from_trigger_timeout(self, value: builtins.bool): | ||
pass | ||
|
||
def adapt_from_trigger_error(self, value: builtins.bool): | ||
pass | ||
|
||
def adapt_from(self, proto: sentry_services.seer.severity_pb2.ScoreRequest): | ||
self.adapt_from_message(proto.message) | ||
self.adapt_from_has_stacktrace(proto.has_stacktrace) | ||
if proto.HasField("handled"): | ||
self.adapt_from_handled(proto.handled) | ||
self.adapt_from_trigger_timeout(proto.trigger_timeout) | ||
self.adapt_from_trigger_error(proto.trigger_error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider generating 'mapping functions' that accept a target object, and proto request, and return an updated target object? Currently we'll be adding many methods to domain models. A mapping function/builder wouldn't add methods to existing objects and should have similar type safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 sure, I mean, there's tons of things we can do. my point with the demo here is merely to demonstrate that working solutions can be derived from a. 100 line script that make the domain <-> wire transformation easier. When we get to a point of actual implementation, I figure it's something that we iterate on more closely before we declare it standardized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's tons of things we can do. my point with the demo here is merely to demonstrate that working solutions can be derived from a. 100 line script that make the domain <-> wire transformation easier. When we get to a point of actual implementation, I figure it's something that we iterate on more closely before we declare it standardized.
That's fair. I wasn't sure if this was the design you were planning to move forward with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will say this. There are 2 main gotchas we have to consider that drove my current design wether I stick to it:
-
It is super important that adding new fields to protos and generating a new adaptor cannot break existing code. Because I imagine protos being generated in repos separately from their consumers, it means there can be version drift by default, which is actually good. But this also means the default generated code cannot presume any behavior if and when domain model and proto versions drift (either can be updated before the other). In this case, the adaptor's default behavior is a no op unless specifically wired.
-
It won't always be the case that straight assignment works, in the case that types need further, deeper transformation, so a simple assignment sometimes won't be enough. That's probably fine, as the domain model can write setters to receive attributes, and I suspect that is still the exception and not the rule.
A fully working gRPC demo. Working on a notion doc explaining all the parts, open questions, how to improve expand on, etc.
Basic idea is that this is a minimal working example of
https://www.notion.so/sentry/gRPC-Implementation-For-Seer-cb14fb94e90d4afaa4d80c664e0c9031?showMoveTo=true&saveParent=true