Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle timeouts more gracefully by allowing the application to shutdown #895

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions couscous.yml
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,10 @@ menu:
text: Databases
url: /docs/environment/database.html
title: Using a database from AWS Lambda
timeouts:
text: Timeouts
url: /docs/environment/timeouts.html
title: Configure and handle timeouts
custom-domains:
text: Custom domains
url: /docs/environment/custom-domains.html
Expand Down
2 changes: 2 additions & 0 deletions docs/environment/database.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ When possible, an alternative to NAT Gateways is to split the work done by a lam

Finally, another free alternative to NAT Gateway is to access AWS services by creating "*private VPC endpoints*": this is possible for S3, API Gateway, [and more](https://docs.aws.amazon.com/en_pv/vpc/latest/userguide/vpc-endpoints-access.html).

Read more in the section about [timeouts](/docs/environment/timeouts.md).

## Creating a database

On the [RDS console](https://console.aws.amazon.com/rds/home):
Expand Down
63 changes: 63 additions & 0 deletions docs/environment/timeouts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
---
title: Timeouts
current_menu: timeouts
introduction: Configure and handle timeouts.
---

When a Lambda function times out, it is like the power to the computer is suddenly
just turned off. This does not give the application a chance to shut down properly.
This leaves you without any logs and the problem could be hard to fix.

To allow your application to shut down properly and write logs, Bref can throw an exception just before the Lambda times out.

> Note, this feature is experimental and available since Bref 1.3.

To enable this feature **in `php-XX` layers**, set the environment variable `BREF_FEATURE_TIMEOUT`:

```yaml
provider:
environment:
BREF_FEATURE_TIMEOUT: 1
```

To enable this feature **in `php-XX-fpm` layers**, call `Timeout::enableInFpm()` in your application.
For example in `index.php`:

```php
if (isset($_SERVER['LAMBDA_TASK_ROOT'])) {
\Bref\Timeout\Timeout::enableInFpm();
}
```

Whenever a timeout happens, a full stack trace will be logged, including the line that was executing.

In most cases, it is an external call to a database, cache or API that is stuck waiting.
If you are using a RDS database, [you are encouraged to read this section](database.md#accessing-the-internet).

## Catching the exception

You can catch the timeout exception to perform some cleanup, logs or even display a proper error page.

In `php-XX-fpm` layers, most frameworks will catch the `LambdaTimeout` exception automatically (like any other error).

In `php-XX` layers, you can catch it in your handlers. For example:

```php
use Bref\Context\Context;
use Bref\Timeout\LambdaTimeout;

class Handler implements \Bref\Event\Handler
{
public function handle($event, Context $context)
{
try {
// your code here
// ...
} catch (LambdaTimeout $e) {
echo 'Oops, sorry. We spent too much time on this.';
} catch (\Throwable $e) {
echo 'Some other unexpected error happened.';
}
}
}
```
9 changes: 9 additions & 0 deletions runtime/layers/fpm/bootstrap
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,15 @@ use Bref\Runtime\LambdaRuntime;
ini_set('display_errors', '1');
error_reporting(E_ALL);

/**
* Disable the Bref timeout feature in the main bootstrap process.
* Indeed it doesn't make sense here because the application code runs in the
* FPM worker process, not in this process (the bootstrap process). We don't want to
* terminate the bootstrap process while leaving the FPM worker running.
* That's why we force `BREF_FEATURE_TIMEOUT` to be disabled.
*/
putenv('BREF_FEATURE_TIMEOUT=0');

$appRoot = getenv('LAMBDA_TASK_ROOT');

if (getenv('BREF_DOWNLOAD_VENDOR')) {
Expand Down
17 changes: 15 additions & 2 deletions src/Runtime/LambdaRuntime.php
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
use Bref\Context\Context;
use Bref\Context\ContextBuilder;
use Bref\Event\Handler;
use Bref\Timeout\Timeout;
use Exception;
use Psr\Http\Server\RequestHandlerInterface;

Expand Down Expand Up @@ -42,19 +43,23 @@ final class LambdaRuntime
/** @var Invoker */
private $invoker;

/** @var bool */
private $enableTimeout;

public static function fromEnvironmentVariable(): self
{
return new self((string) getenv('AWS_LAMBDA_RUNTIME_API'));
return new self((string) getenv('AWS_LAMBDA_RUNTIME_API'), (bool) getenv('BREF_FEATURE_TIMEOUT'));
}

public function __construct(string $apiUrl)
private function __construct(string $apiUrl, bool $enableTimeout)
{
if ($apiUrl === '') {
die('At the moment lambdas can only be executed in an Lambda environment');
}

$this->apiUrl = $apiUrl;
$this->invoker = new Invoker;
$this->enableTimeout = $enableTimeout;
}

public function __destruct()
Expand Down Expand Up @@ -96,6 +101,12 @@ public function processNextEvent($handler): void
[$event, $context] = $this->waitNextInvocation();
\assert($context instanceof Context);

$remainingTimeInMillis = $context->getRemainingTimeInMillis();
if ($this->enableTimeout && 0 < $remainingTimeInMillis) {
// Throw exception one second before Lambda pulls the plug.
Timeout::enable($context->getRemainingTimeInMillis());
}

$this->ping();

try {
Expand All @@ -104,6 +115,8 @@ public function processNextEvent($handler): void
$this->sendResponse($context->getAwsRequestId(), $result);
} catch (\Throwable $e) {
$this->signalFailure($context->getAwsRequestId(), $e);
} finally {
Timeout::reset();
}
}

Expand Down
12 changes: 12 additions & 0 deletions src/Timeout/LambdaTimeout.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<?php declare(strict_types=1);

namespace Bref\Timeout;

/**
* The application took too long to produce a response. This exception is thrown
* to give the application a chance to flush logs and shut it self down before
* the power to AWS Lambda is disconnected.
*/
class LambdaTimeout extends \RuntimeException
{
}
108 changes: 108 additions & 0 deletions src/Timeout/Timeout.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
<?php declare(strict_types=1);

namespace Bref\Timeout;

/**
* Helper class to trigger an exception just before the Lambda times out. This
* will give the application a chance to shut down.
*/
final class Timeout
{
/** @var bool */
private static $initialized = false;

/** @var string|null */
private static $stackTrace = null;

/**
* Automatically setup a timeout (based on the AWS Lambda timeout).
*
* This method can only be called when running in PHP-FPM, i.e. when using a `php-XX-fpm` layer.
*/
public static function enableInFpm(): void
{
if (! isset($_SERVER['LAMBDA_INVOCATION_CONTEXT'])) {
throw new \LogicException('Could not find value for bref timeout. Are we running on Lambda?');
}

$context = json_decode($_SERVER['LAMBDA_INVOCATION_CONTEXT'], true, 512, JSON_THROW_ON_ERROR);
$deadlineMs = $context['deadlineMs'];
$remainingTimeInMillis = $deadlineMs - intval(microtime(true) * 1000);

self::enable($remainingTimeInMillis);
}

/**
* @internal
*/
public static function enable(int $remainingTimeInMillis): void
{
self::init();

$remainingTimeInSeconds = (int) floor($remainingTimeInMillis / 1000);

// The script will timeout 2 seconds before the remaining time
// to allow some time for Bref/our app to recover and cleanup
$margin = 2;

$timeoutDelayInSeconds = max(1, $remainingTimeInSeconds - $margin);

// Trigger SIGALRM in X seconds
pcntl_alarm($timeoutDelayInSeconds);
}

/**
* Setup custom handler for SIGALRM.
*/
private static function init(): void
{
self::$stackTrace = null;

if (self::$initialized) {
return;
}

if (! function_exists('pcntl_async_signals')) {
trigger_error('Could not enable timeout exceptions because pcntl extension is not enabled.');
return;
}

pcntl_async_signals(true);
// Setup a handler for SIGALRM that throws an exception
// This will interrupt any running PHP code, including `sleep()` or code stuck waiting for I/O.
pcntl_signal(SIGALRM, function (): void {
if (Timeout::$stackTrace !== null) {
// we already thrown an exception, do a harder exit.
error_log('Lambda timed out');
error_log((new LambdaTimeout)->getTraceAsString());
error_log('Original stack trace');
error_log(Timeout::$stackTrace);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, I'm testing and notice that error_log adds noise in FPM:

NOTICE: PHP message: Lambda timed out
NOTICE: PHP message: #0 /var/task/demo/http.php(16): Bref\Timeout\Timeout::Bref\Timeout\{closure}()
#1 {main}
NOTICE: PHP message: Original stack trace
NOTICE: PHP message: #0 /var/task/demo/http.php(9): Bref\Timeout\Timeout::Bref\Timeout\{closure}()
#1 {main}

I'm changing that to:

                $stderr = fopen('php://stderr', 'w');
                fwrite($stderr, 'Lambda timed out' . PHP_EOL);
                fwrite($stderr, (new LambdaTimeout)->getTraceAsString() . PHP_EOL);
                fwrite($stderr, 'Original stack trace' . PHP_EOL);
                fwrite($stderr, Timeout::$stackTrace . PHP_EOL);
                fclose($stderr);

Which gives a cleaner log:

Lambda timed out
#0 /var/task/demo/http.php(16): Bref\Timeout\Timeout::Bref\Timeout\{closure}()
#1 {main}
Original stack trace
#0 /var/task/demo/http.php(9): Bref\Timeout\Timeout::Bref\Timeout\{closure}()
#1 {main}

No need to update the PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didnt know.

Thank you for testing.


exit(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we force a 500 status code before exit?

That would only be taken into account in the FPM layer, so that wouldn't really impact the php-XX layer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How?
And wouldn't that effect my AWS::ApiGateway::GatewayResponse that triggers on timeouts or 500?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http_response_code(500);

And wouldn't that effect my AWS::ApiGateway::GatewayResponse that triggers on timeouts or 500?

🤔 not sure about that, maybe. But in any case, the same would happen if your app returns a pretty 500 error page.

Does that make sense? What I mean is that, when using FPM, there would be no difference between:

  • Symfony catches any exception and displays an error page with a 500 status code
  • Bref timeout kicks in and forces a 500 status code

In both cases, API Gateway would behave the same way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, no that is not true.

Symfony catches any exception and displays an error page with a 500 status code

This will show you a nice error page. API Gateway does nothing

Bref timeout kicks in and forces a 500 status code

No content(HTML) is provided. I would like the custom error page in API Gateway to be used.


Im not sure if http_response_code(500) would make any difference.


Also note. We are not sure it is a HTTP context. It could be any other event.. So Im not sure about sending a HTTP status code.

}

$exception = new LambdaTimeout('Maximum AWS Lambda execution time reached');
Timeout::$stackTrace = $exception->getTraceAsString();

// Trigger another alarm after 1 second to do a hard exit.
pcntl_alarm(1);

throw $exception;
});

self::$initialized = true;
}

/**
* Cancel all current timeouts.
*
* @internal
*/
public static function reset(): void
{
if (self::$initialized) {
pcntl_alarm(0);
self::$stackTrace = null;
}
}
}
34 changes: 33 additions & 1 deletion tests/Runtime/LambdaRuntimeTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
use Bref\Event\Sqs\SqsHandler;
use Bref\Runtime\LambdaRuntime;
use Bref\Test\Server;
use Bref\Timeout\LambdaTimeout;
use Bref\Timeout\Timeout;
use GuzzleHttp\Psr7\Response;
use PHPUnit\Framework\TestCase;
use Psr\Http\Message\ResponseInterface;
Expand All @@ -35,13 +37,41 @@ protected function setUp(): void
{
ob_start();
Server::start();
$this->runtime = new LambdaRuntime('localhost:8126');
// Mock the API with our test server
putenv('AWS_LAMBDA_RUNTIME_API=localhost:8126');
$this->runtime = LambdaRuntime::fromEnvironmentVariable();
}

protected function tearDown(): void
{
Server::stop();
ob_end_clean();
// Cleanup
putenv('AWS_LAMBDA_RUNTIME_API=');
putenv('BREF_FEATURE_TIMEOUT=');
}

public function test Lambda timeouts can be anticipated()
{
// Re-create the runtime class with timeout prevention enabled
putenv('BREF_FEATURE_TIMEOUT=1');
$this->runtime = LambdaRuntime::fromEnvironmentVariable();

$this->givenAnEvent([]);

$start = microtime(true);
$this->runtime->processNextEvent(function () {
// This 10s sleep should be interrupted
sleep(10);
});

$elapsedTime = microtime(true) - $start;
// The Lambda timeout was 2 seconds, we expect the Bref timeout to trigger 1 second before that: 1 second
$this->assertEqualsWithDelta(1, $elapsedTime, 0.2);
Timeout::reset();

$this->assertInvocationErrorResult(LambdaTimeout::class, 'Maximum AWS Lambda execution time reached');
$this->assertErrorInLogs(LambdaTimeout::class, 'Maximum AWS Lambda execution time reached');
}

public function test basic behavior()
Expand Down Expand Up @@ -348,6 +378,8 @@ private function givenAnEvent($event): void
[
'lambda-runtime-aws-request-id' => '1',
'lambda-runtime-invoked-function-arn' => 'test-function-name',
// now + 2 seconds
'lambda-runtime-deadline-ms' => intval((microtime(true) + 2) * 1000),
],
json_encode($event)
),
Expand Down
16 changes: 12 additions & 4 deletions tests/Server.php
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,13 @@ public static function stop()
if (self::$started) {
self::getClient()->request('DELETE', 'guzzle-server');
}
$tries = 0;
while (self::isListening() && ++$tries < 5) {
usleep(100000);
}
if (self::isListening()) {
throw new \RuntimeException('Unable to stop node.js server');
}
self::$started = false;
}

Expand All @@ -159,11 +166,12 @@ public static function start()
if (self::$started) {
return;
}
if (! self::isListening()) {
exec('node ' . __DIR__ . '/server.js '
. self::$port . ' >> /tmp/server.log 2>&1 &');
self::wait();
if (self::isListening()) {
throw new \Exception('Server is already running');
}
exec('node ' . __DIR__ . '/server.js '
. self::$port . ' >> /tmp/server.log 2>&1 &');
self::wait();
self::$started = true;
}

Expand Down
Loading