Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine: Limit payload sizes of events #890

Merged
merged 17 commits into from
Mar 6, 2025
Merged

Engine: Limit payload sizes of events #890

merged 17 commits into from
Mar 6, 2025

Conversation

josephjclark
Copy link
Collaborator

@josephjclark josephjclark commented Mar 4, 2025

Short Description

This PR fixes an issue where large payloads can cause an OOM exception while being passed from the child process to te main worker, causing the main worker thread to blow up.

Fixes #888

Implementation Details

This may be a tenporary fix, I'm not sure. It's so hard to debug this area of the engine it's tricky to make progress.

Recall that the architecture is like:

  • The worker sits in a process of its own
  • It creates a pool of child processes
  • Each run then spins out a worker thread to execute inside

Basically what we're doing is:

  • Every event that's about to leave the worker thread gets tested
  • We stringify it and guess the size
  • If we think it's too big, we trim/redact the content before sending it out

So large payloads never leave the worker thread, and so don't get processed downstream. This is definitely how I should have done it in the first place. As you'll see in the TODOs, maybe we should be doing something very different (maybe even using a local socket for data streaming? Not use IPC at all?)

Also TODO: I'd like a better way to process these IPC events. There must be a better way to handle large JSON objects. We for sure don't need to keep serializing/deserialising JSON between all the processes - I'd love to add a more efficient encoding in the worker thread, and decode it in the main process. Maybe everything gets dumped into an array buffer, which needs minimal parsing. We might still need to make the event name available - but actually, between the threads, I don't really think so. We should be able to encode it all and do all our analysis in the main thread. I've raised a new issue for this.

QA

Here's a job to test this in the app:

fn((state) => {
  const limit = state.limit || 20;
  state.data = [];
  for (let i = 0; i < limit; i++) {
    state.data.push(new Array(1024 * 1024).fill('a').join(''));
  }
  // log it if you like - it SHOULD be redacted
  //console.log(state
  return state;
})

This should create a state object which blows the payload limit and does not get sent back to lightning.

Now, in production, returning too many large objects from the job can trigger an OOM exception. This is a little hard to reproduce because you need to create something big enough to blow up the worker, but small enough to not be OOM killed by the engine.

Locally, starting the worker like this will trigger an OOM on main, but be stable here:

NODE_OPTIONS='--max-old-space-size=100' pnpm start

That gives the worker 100mb of memory. On main a 20mb payload is enough to kill it, but it's fine here because a 20mb payload won't be processed in the main thread at all.

AI Usage

Please disclose how you've used AI in this work (it's cool, we just want to know!):

  • Code generation (copilot but not intellisense)
  • Learning or fact checking
  • Strategy / design
  • Optimisation / refactoring
  • Translation / spellchecking / doc gen
  • Other
  • I have not used AI

You can read more details in our Responsible AI Policy

@@ -0,0 +1,50 @@
export const REDACTED_STATE = {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part 1: this file has been moved from the worker to the engine, and lightly modified.

Note that it selectively validates payload objects so that it can generate an appropriate "fix"

@josephjclark
Copy link
Collaborator Author

I've opened out #891 to handle more efficient serialisation and comms between processes. I don't want that to block release though.

@josephjclark
Copy link
Collaborator Author

josephjclark commented Mar 4, 2025

Disappointingly, if I run the worker with drastically reduced memory:

 NODE_OPTIONS='--max-old-space-size=100' pnpm start

I still get an OOM blow up after the run has finished.

Why? The large object should not be loaded into the main worker environment at all. What is requiring all that memory?

A 20mb payload will kill a 100mb worker. This is disappointing.

To be clear, small payloads do not blow up the worker. There is enough memory for basic functionality, but not to process large JSON strings.

It still blows up if I disconnect log events and force step-complete to work with an empty object.

The exception comes out of JSON parsing / string decoding on the worker side, so clearly something is still coming through.

Or is it that something is dying inside the child process? Did the child process OOMKill? OR does it share memory with the parent?

Detatching the forked child process doesn't help (surely this would give it independent memory)

@doc-han
Copy link
Collaborator

doc-han commented Mar 4, 2025

Why? The large object should not be loaded into the main worker environment at all. What is requiring all that memory?

Exactly what I was thinking.

@josephjclark
Copy link
Collaborator Author

This morning's revelation: remember that the engine is sitting in the main process too. It's not the worker blowing up, its the engine

But! The same principle applies. The inner worker thread should not be sending large payloads out to the engine. I can see it here that the engine emits the final workflow-complete event with the redacted state object.

Am I looking at something unrelated? Is 120mb just not enough for basic running of the worker and engine? I don't think so. If I bump the worker memory AND payload size up, we get the same blowup

@josephjclark
Copy link
Collaborator Author

Got it: the engine is publishing the full, non redacted state on the internal "resolve_task" event. That's what's causing the blowup now. If I fix/redact that, I should be stable.

If the result is large, it might trigger OOM while being returned to the main process
@josephjclark josephjclark marked this pull request as ready for review March 5, 2025 12:08
@josephjclark josephjclark merged commit ccc16ff into main Mar 6, 2025
11 checks passed
@josephjclark josephjclark deleted the engine-limit branch March 6, 2025 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

OOM blow up in the worker
2 participants