-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Engine: Limit payload sizes of events #890
Conversation
@@ -0,0 +1,50 @@ | |||
export const REDACTED_STATE = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part 1: this file has been moved from the worker to the engine, and lightly modified.
Note that it selectively validates payload objects so that it can generate an appropriate "fix"
I've opened out #891 to handle more efficient serialisation and comms between processes. I don't want that to block release though. |
Disappointingly, if I run the worker with drastically reduced memory:
I still get an OOM blow up after the run has finished. Why? The large object should not be loaded into the main worker environment at all. What is requiring all that memory? A 20mb payload will kill a 100mb worker. This is disappointing. To be clear, small payloads do not blow up the worker. There is enough memory for basic functionality, but not to process large JSON strings. It still blows up if I disconnect log events and force step-complete to work with an empty object. The exception comes out of JSON parsing / string decoding on the worker side, so clearly something is still coming through. Or is it that something is dying inside the child process? Did the child process OOMKill? OR does it share memory with the parent? Detatching the forked child process doesn't help (surely this would give it independent memory) |
Exactly what I was thinking. |
This morning's revelation: remember that the engine is sitting in the main process too. It's not the worker blowing up, its the engine But! The same principle applies. The inner worker thread should not be sending large payloads out to the engine. I can see it here that the engine emits the final workflow-complete event with the redacted state object. Am I looking at something unrelated? Is 120mb just not enough for basic running of the worker and engine? I don't think so. If I bump the worker memory AND payload size up, we get the same blowup |
Got it: the engine is publishing the full, non redacted state on the internal "resolve_task" event. That's what's causing the blowup now. If I fix/redact that, I should be stable. |
If the result is large, it might trigger OOM while being returned to the main process
Short Description
This PR fixes an issue where large payloads can cause an OOM exception while being passed from the child process to te main worker, causing the main worker thread to blow up.
Fixes #888
Implementation Details
This may be a tenporary fix, I'm not sure. It's so hard to debug this area of the engine it's tricky to make progress.
Recall that the architecture is like:
Basically what we're doing is:
So large payloads never leave the worker thread, and so don't get processed downstream. This is definitely how I should have done it in the first place. As you'll see in the TODOs, maybe we should be doing something very different (maybe even using a local socket for data streaming? Not use IPC at all?)
Also TODO: I'd like a better way to process these IPC events. There must be a better way to handle large JSON objects. We for sure don't need to keep serializing/deserialising JSON between all the processes - I'd love to add a more efficient encoding in the worker thread, and decode it in the main process. Maybe everything gets dumped into an array buffer, which needs minimal parsing. We might still need to make the event name available - but actually, between the threads, I don't really think so. We should be able to encode it all and do all our analysis in the main thread. I've raised a new issue for this.
QA
Here's a job to test this in the app:
This should create a state object which blows the payload limit and does not get sent back to lightning.
Now, in production, returning too many large objects from the job can trigger an OOM exception. This is a little hard to reproduce because you need to create something big enough to blow up the worker, but small enough to not be OOM killed by the engine.
Locally, starting the worker like this will trigger an OOM on main, but be stable here:
That gives the worker 100mb of memory. On main a 20mb payload is enough to kill it, but it's fine here because a 20mb payload won't be processed in the main thread at all.
AI Usage
Please disclose how you've used AI in this work (it's cool, we just want to know!):
You can read more details in our Responsible AI Policy