Engine: Limit payload sizes of events #890

josephjclark · 2025-03-04T15:47:03Z

Short Description

This PR fixes an issue where large payloads can cause an OOM exception while being passed from the child process to te main worker, causing the main worker thread to blow up.

Fixes #888

Implementation Details

This may be a tenporary fix, I'm not sure. It's so hard to debug this area of the engine it's tricky to make progress.

Recall that the architecture is like:

The worker sits in a process of its own
It creates a pool of child processes
Each run then spins out a worker thread to execute inside

Basically what we're doing is:

Every event that's about to leave the worker thread gets tested
We stringify it and guess the size
If we think it's too big, we trim/redact the content before sending it out

So large payloads never leave the worker thread, and so don't get processed downstream. This is definitely how I should have done it in the first place. As you'll see in the TODOs, maybe we should be doing something very different (maybe even using a local socket for data streaming? Not use IPC at all?)

Also TODO: I'd like a better way to process these IPC events. There must be a better way to handle large JSON objects. We for sure don't need to keep serializing/deserialising JSON between all the processes - I'd love to add a more efficient encoding in the worker thread, and decode it in the main process. Maybe everything gets dumped into an array buffer, which needs minimal parsing. We might still need to make the event name available - but actually, between the threads, I don't really think so. We should be able to encode it all and do all our analysis in the main thread. I've raised a new issue for this.

QA

Here's a job to test this in the app:

fn((state) => {
  const limit = state.limit || 20;
  state.data = [];
  for (let i = 0; i < limit; i++) {
    state.data.push(new Array(1024 * 1024).fill('a').join(''));
  }
  // log it if you like - it SHOULD be redacted
  //console.log(state
  return state;
})

This should create a state object which blows the payload limit and does not get sent back to lightning.

Now, in production, returning too many large objects from the job can trigger an OOM exception. This is a little hard to reproduce because you need to create something big enough to blow up the worker, but small enough to not be OOM killed by the engine.

Locally, starting the worker like this will trigger an OOM on main, but be stable here:

NODE_OPTIONS='--max-old-space-size=100' pnpm start

That gives the worker 100mb of memory. On main a 20mb payload is enough to kill it, but it's fine here because a 20mb payload won't be processed in the main thread at all.

AI Usage

Please disclose how you've used AI in this work (it's cool, we just want to know!):

You can read more details in our Responsible AI Policy

josephjclark · 2025-03-04T19:00:37Z

packages/engine-multi/src/util/ensure-payload-size.ts

@@ -0,0 +1,50 @@
+export const REDACTED_STATE = {


This part 1: this file has been moved from the worker to the engine, and lightly modified.

Note that it selectively validates payload objects so that it can generate an appropriate "fix"

josephjclark · 2025-03-04T19:08:31Z

I've opened out #891 to handle more efficient serialisation and comms between processes. I don't want that to block release though.

josephjclark · 2025-03-04T19:22:07Z

Disappointingly, if I run the worker with drastically reduced memory:

 NODE_OPTIONS='--max-old-space-size=100' pnpm start

I still get an OOM blow up after the run has finished.

Why? The large object should not be loaded into the main worker environment at all. What is requiring all that memory?

A 20mb payload will kill a 100mb worker. This is disappointing.

To be clear, small payloads do not blow up the worker. There is enough memory for basic functionality, but not to process large JSON strings.

It still blows up if I disconnect log events and force step-complete to work with an empty object.

The exception comes out of JSON parsing / string decoding on the worker side, so clearly something is still coming through.

Or is it that something is dying inside the child process? Did the child process OOMKill? OR does it share memory with the parent?

Detatching the forked child process doesn't help (surely this would give it independent memory)

doc-han · 2025-03-04T19:30:27Z

Why? The large object should not be loaded into the main worker environment at all. What is requiring all that memory?

Exactly what I was thinking.

josephjclark · 2025-03-05T09:22:26Z

This morning's revelation: remember that the engine is sitting in the main process too. It's not the worker blowing up, its the engine

But! The same principle applies. The inner worker thread should not be sending large payloads out to the engine. I can see it here that the engine emits the final workflow-complete event with the redacted state object.

Am I looking at something unrelated? Is 120mb just not enough for basic running of the worker and engine? I don't think so. If I bump the worker memory AND payload size up, we get the same blowup

josephjclark · 2025-03-05T09:26:48Z

Got it: the engine is publishing the full, non redacted state on the internal "resolve_task" event. That's what's causing the blowup now. If I fix/redact that, I should be stable.

If the result is large, it might trigger OOM while being returned to the main process

josephjclark added 10 commits March 4, 2025 15:20

engine: force a limit on payloads exiting the engine

d011f82

engine: add redacted flag to redacted events

f843bfc

worker: remove payload validation logic

b9ae9aa

typing

fa26a91

fix eventing

f474ed7

typing

99f1183

tweak error message

6a09c09

integration test

7055332

prefer Buffer .byteLength to new Blob()

ebe8082

changeset

d50c05d

josephjclark commented Mar 4, 2025

View reviewed changes

remove clutter

4ed0685

engine: optionally return the run result

ff54f2c

If the result is large, it might trigger OOM while being returned to the main process

josephjclark marked this pull request as ready for review March 5, 2025 12:08

josephjclark added 5 commits March 5, 2025 12:17

tweak error message

3d236c1

extra engine changeset

deb7293

typo

e6a38b3

engine: remove debug code

9336ab4

version: [email protected]

0f5ca29

josephjclark merged commit ccc16ff into main Mar 6, 2025
11 checks passed

josephjclark deleted the engine-limit branch March 6, 2025 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engine: Limit payload sizes of events #890

Engine: Limit payload sizes of events #890

josephjclark commented Mar 4, 2025 •

edited

Loading

josephjclark Mar 4, 2025

josephjclark commented Mar 4, 2025

josephjclark commented Mar 4, 2025 •

edited

Loading

doc-han commented Mar 4, 2025

josephjclark commented Mar 5, 2025

josephjclark commented Mar 5, 2025

Engine: Limit payload sizes of events #890

Engine: Limit payload sizes of events #890

Conversation

josephjclark commented Mar 4, 2025 • edited Loading

Short Description

Implementation Details

QA

AI Usage

josephjclark Mar 4, 2025

Choose a reason for hiding this comment

josephjclark commented Mar 4, 2025

josephjclark commented Mar 4, 2025 • edited Loading

doc-han commented Mar 4, 2025

josephjclark commented Mar 5, 2025

josephjclark commented Mar 5, 2025

josephjclark commented Mar 4, 2025 •

edited

Loading

josephjclark commented Mar 4, 2025 •

edited

Loading