Skip to content

feat(tracing, lambda-runtime): add support for custom writer with default tracing subscriber, add turnkey graceful shutdown helper behind 'graceful-shutdown' feature flag #982

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 8, 2025

Conversation

jlizen
Copy link
Contributor

@jlizen jlizen commented May 4, 2025

📬 Issue #, if available:
Closes #983
✍️ Description of changes:

Custom writer support for default tracing subscriber

The current tracing subscriber uses the default stdout writer. This mostly is the right choice, but it does run the risk of occasionally blocking the tokio runtime - ref: tokio-rs/tracing#2653

For high throughput, performance sensitive lambdas that have async concurrency in their handlers, they might instead prefer to use a non-blocking writer on a dedicated logging thread.

You can do that today by initializing a totally custom tracing subscriber, but the default one has convenient integrations with some native Lambda ENVs, that I'd like to keep using.

Ideally we would have an API to return the configured fmt::SubscriberBuilder rather than implicitly initializing it in our helper. But, the tracing types are very unpleasant to work with due to the JSON vs full formatting choice. We would need to introduce a layer of dynamic dispatch, custom traits, or something else like that, if we wanted to bake in the handling for the AWS_LAMBDA_LOG_FORMAT ENV. We could also have a default subscriber builder that adds the JSON format at the end, but that felt very heavyweight for a simple 'default subscriber' helper API.

I felt that adding a custom writer is an important enough knob that a new init_default_subscriber_with_writer() API was warranted. It is much simpler code.

I don't think the semantic version of risk of surfacing the fmt::MakeWriter trait bound is significant - first of all, we already re-export all the tracing types, so presumably we would treat a major version tracing-subscriber bump as our own major version bump. Second, the MakeWriter trait is very simple and seems unlikely to change; it just expects a function that returns something that implements std::io::Writer.

Note that we also now explicitly initialize init_default_subscriber() using STDOUT writer. This is the current default behavior of tracing subscriber, but it could presumably change across a major version bump. Probably better to pin it and make it a conscious choice on our own end to change it, if we later desire.

Documentation decisions

I didn't explicitly document this in the README, since I think it would add confusion as it is not the right choice for most users. I do reference the API in the new Graceful shutdown README section, which I think is a useful breadcrumb for the sort of user that might need it.

The function docs do give a pretty good overview of what scenarios it might be appropriate for.

Turnkey 'graceful shutdown hook'

Previously I had a more extended graceful shutdown example that added manual non-op extension registry, signal handling, etc. @bnusunny pointed out that this could be very cheaply abstracted away by the library, and he was right!

I went with a route that was pretty opinionated and on-rails, to reduce cognitive load of the API. Specifically:

  • I implicitly register a dummy non-op extension that is subscribed to no events
  • I implicitly construct the signal handling loop for SIGTERM and SIGINT, wrapping the user-provided hook
  • I implicitly spawn a task to drive the signal handling + extension future, rather than returning to to the caller to interact with
  • I panic on unrecoverable errors rather than returning a fallible JoinHandle or since the caller generally will be thinking of this as a background task

Note that the dummy extension is very low cost, since the extension.nextEvent API will keep a connection but then hang indefinitely if no events come. So it won't frequently wake the task that contains it and the signal handling future. Generally waking will be from the signal handling. The cost is essentially the memory to keep the connection open (and also in my testing, the extension future was exiting on run() when no events were registered, though I wouldn't count on that behavior).

Nothing is stopping the caller from manually constructing a more sophisticated set of signal handling, avoiding registering a dummy extension if they already have another one, etc.

I did reference these choices in the function docs. I also shared the panic conditions as well as a realistic usage example. I also added a small section in the README to help with discoverability.

Anyway, I think that this is the API that is proper for 95%+ of callers, and anyway we can expand on it if we need to. I assume at that point it probably is worth baking into the Runtime type since anyway that is where we add more granularity of behavior.

Dependency changes

I added a new feature flag for this, graceful-shutdown, since it adds new dependencies:

  • tokio's signal feature
  • lambda-extension

Also worth noting is that this is the first time we actually need the tokio rt feature (tokio::signal::signal() requires it, as does task::spawn()). That matters more for #984 , but I made this feature a default feature to make it easier to rip that out without breaking people if we ever remove the vestigial dependency on rt/rt-multi-threaded without feature flags (across a semver boundary).

I didn't add a lib.rs passage documenting this new feature flag, but I did mention it in a new README section. I also added some docs.rs metatags to give us the nice feature indicator for both this as well as the tracing module re-export.

Testing

I didn't see any great existing unit tests or integration tests that can be run locally, to extend to cover this case. My doctest validates that it will compile, but not that it runs, since it would hang indefinitely waiting for events. It would be cool to set up some sort of build-time integ testing that uses cargo lambda, but that wasn't scope I had time for.

I did throw in the graceful shutdown helper in our integration test that presumably runs against github actions - that at least will capture a fatal regression in our logic implicitly spinning up the non-op extension, etc, that would prevent the function from starting properly. Open to cutting a new integration test to run in CI that more specifically probes this, if folks feel it is necessary. Itust will be a bit awkward, since we'll need to have it sleep for a while to let the function spin down. Also either we will need sort of callback from the function to call against our github CI runner from inside the shutdown hook (aka expose a server in Github CI), or else poll eg CloudWatch Metrics or CloudWatch Logs to validate that the graceful shutdown logic fired.

Manual e2e testing

I did test a complete example manually both:

  • locally (with my cargo-lambda watch signal handling fix in place - ref
  • after deploying to an AWS account

Both showed the expected behavior:

  • function runs and responds properly to handlers
  • graceful shutdown shows up in logs on runtime shutdown:

Full AWS logs are at the end of this section.

I previously had my test code checked in as a complete examples/ folder code. But, it felt kind of ridiculous when literally the doc comment was sufficient to demonstrate the behavior. I'm glad to add it back in if we think it is useful. Another option would be to use that example to show a more 'custom' signal handling use case that doesn't use our helper, but I was concerned that that would confuse readers.

AWS logs:

INIT_START Runtime Version: provided:al2023.v90	Runtime Version ARN: arn:aws:lambda:us-east-1::runtime:e49c711d78a410691cb8ac89b78502f273c233e01eed3d50c5ae2d6acc83b037
EXTENSION	Name: _lambda-rust-runtime-no-op-graceful-shutdown-helper	State: Ready	Events: []
START RequestId: adfa53cb-4372-4559-9879-be77c28359d6 Version: $LATEST
INFO Lambda runtime invoke{requestId="adfa53cb-4372-4559-9879-be77c28359d6" xrayTraceId="Root=1-681a76f5-2af5130749044a1110e28326;Parent=359483ddc68179b2;Sampled=0;Lineage=1:99116690:0"}: executing hello
END RequestId: adfa53cb-4372-4559-9879-be77c28359d6
REPORT RequestId: adfa53cb-4372-4559-9879-be77c28359d6	Duration: 2.18 ms	Billed Duration: 41 ms	Memory Size: 128 MB	Max Memory Used: 21 MB	Init Duration: 38.10 ms	
START RequestId: 5c971ea0-e406-474f-b952-24f9b76f876b Version: $LATEST
INFO Lambda runtime invoke{requestId="5c971ea0-e406-474f-b952-24f9b76f876b" xrayTraceId="Root=1-681a76f9-39ac3cb87f1fa86462caaa8c;Parent=76252a84c5d7fe51;Sampled=0;Lineage=1:99116690:0"}: executing hello
END RequestId: 5c971ea0-e406-474f-b952-24f9b76f876b
REPORT RequestId: 5c971ea0-e406-474f-b952-24f9b76f876b	Duration: 1.25 ms	Billed Duration: 2 ms	Memory Size: 128 MB	Max Memory Used: 21 MB	
[runtime] SIGTERM received
[runtime] Graceful shutdown in progress ...
[runtime] Graceful shutdown completed

🔏 By submitting this pull request

  • [x ] I confirm that I've ran cargo +nightly fmt.
  • [x ] I confirm that I've ran cargo clippy --fix.
  • [x ] I confirm that I've made a best effort attempt to update all relevant documentation.
  • [x ] I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jlizen
Copy link
Contributor Author

jlizen commented May 6, 2025

A couple updates:

I do have a PR out to tokio for biased support in try_join:
tokio-rs/tokio#7307

If we merge this with the TODO still in place, I can circle back and cut it over once it lands.

I also have cargo lambda watch working locally against the example with this fix in place:
cargo-lambda/cargo-lambda#853

I'll push a tweak to the README tomorrow giving local testing instructions once that lands.

@jlizen
Copy link
Contributor Author

jlizen commented May 6, 2025

I might scope creep this PR to also add in a turnkey 'graceful shutdown' hook - #983

But I'd also be happy to merge it as is if a maintainer reviews it before then. In that case I'd backport whatever new API for graceful shut down to the example.

@jlizen jlizen force-pushed the tracing-non-blocking branch from 56922f3 to 5c60ffc Compare May 6, 2025 21:05
@jlizen jlizen changed the title feat(tracing, docs): add support for custom writer with default tracing subscriber, add graceful shutdown example using a non-blocking writer feat(tracing, lambda-runtime): add support for custom writer with default tracing subscriber, add turnkey graceful shutdown helper behind 'graceful-shutdown' feature flag May 6, 2025
@jlizen jlizen force-pushed the tracing-non-blocking branch 4 times, most recently from db25a8a to ff784ee Compare May 6, 2025 21:36
@jlizen jlizen force-pushed the tracing-non-blocking branch from 82ed6ba to eacb593 Compare May 7, 2025 16:30
@jlizen jlizen requested a review from bnusunny May 7, 2025 16:31
Copy link
Contributor

@bnusunny bnusunny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature request] Add graceful shutdown hook for functions
2 participants