Skip to content

feat: Add encoding option for binary output (#20)#56

Draft
ide-agent wants to merge 1 commit into
mainfrom
worktree-issue-20-encoding-buffer
Draft

feat: Add encoding option for binary output (#20)#56
ide-agent wants to merge 1 commit into
mainfrom
worktree-issue-20-encoding-buffer

Conversation

@ide-agent

@ide-agent ide-agent commented May 21, 2026

Copy link
Copy Markdown

Why

spawnAsync always decodes child output with toString('utf8'), which corrupts binary output, e.g. pandoc writing a .docx to stdout. (Closes #20.)

How

Add an encoding option to SpawnOptions. encoding: 'buffer' returns stdout / stderr / output as Uint8Array; the default 'utf8' and other BufferEncoding values are unchanged. SpawnResult<T = string> is now parameterized on the stdio element type and overloads select between the string and buffer forms; existing SpawnResult references resolve to SpawnResult<string> with no source changes.

The result is built once when the process exits, freeing the intermediate chunks immediately instead of retaining them behind lazy getters for the result's lifetime. Exceeding the buffer cap rejects in all cases: an explicit maxBuffer already rejected, and the default cap previously resolved then threw lazily on property access.

A maxBuffer larger than the encoding's runtime hard limit (MAX_STRING_LENGTH for text, MAX_LENGTH for 'buffer') now throws TypeError synchronously at the call site. Previously it was silently clamped, which led to confusing rejection messages later.

Behavior changes from 1.8.0

scenario 1.8.0 this PR
Process exits 0, output under cap resolves with full output same
Process exits non-zero (any output size) rejects; truncated stdio attached to the error same
Explicit maxBuffer exceeded rejects with ERR_CHILD_PROCESS_STDIO_MAXBUFFER; truncated stdio attached same
Default cap (MAX_STRING_LENGTH) exceeded resolves; reading result.stdout / result.stderr lazily throws ERR_CHILD_PROCESS_STDIO_MAXBUFFER rejects with the same code; truncated stdio attached
maxBuffer larger than the encoding's hard limit silently clamped via Math.min synchronous TypeError at the call site

Test Plan

Unit tests; the existing suite passes unchanged. New tests:

  • returns stdout/stderr as Uint8Array under encoding: 'buffer'
  • survives a non-UTF-8 byte sequence: bytes preserved, not replaced
  • populates output as [stdout, stderr] of Uint8Array in buffer mode
  • attaches bytes to the error on non-zero exit, like string stdout
  • enforces maxBuffer under encoding: 'buffer'
  • decodes stdout with latin1, and with hex
  • rejects suggesting the string-length limit when text default cap is exceeded
  • rejects suggesting a larger maxBuffer when bytes default cap is exceeded
  • throws TypeError synchronously when maxBuffer exceeds the encoding's hard limit; accepts a maxBuffer exactly equal to it

@ide-agent ide-agent requested a review from ide May 21, 2026 23:08
@ide-agent ide-agent force-pushed the worktree-issue-20-encoding-buffer branch 2 times, most recently from 3add330 to 7af92ba Compare May 21, 2026 23:48
@ide ide requested a review from kitten May 21, 2026 23:52

@ide ide left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kitten I worked on this API to add binary support and there are a couple decisions I'd like to get your thoughts on.

stdout: string + stdoutBytes: Uint8Array vs. stdout: string | Uint8Array. I opted to change which fields get defined in the output based on the specified encoding, rather than change the type of the stdout and stderr fields. This is because it makes the caller's code clearer and more greppable for a reader to see whether a string or byte array is being passed around.

Rejecting the promise vs throwing when lazily accessing stdout/stderr. The reason I removed the lazy accessors is to reduce memory usage in common cases. The peak memory usage is still 2x (chunks + the final output) but we free the chunks immediately after. On the other hand, it could be annoying to have spawnAsync reject just because of the amount of output, when the maxBuffer parameter is really intended as a safe guard.

Comment thread src/spawnAsync.ts Outdated
Comment thread src/spawnAsync.ts Outdated
@kitten

kitten commented May 22, 2026

Copy link
Copy Markdown
Member

@ide: I think, I'd still prefer the overload to be honest, since, from a clarity standpoint, we wouldn't expect encoding and the result types to be too far away from each other in most cases. The typing is part of the clarity here, and Node does this quite frequently too.

If we assume that the most common case is the string output case, then I think that's basically acceptable, and aligning with Node would be (imo) preferable over the difference in calls (I'd say, if we do split them, it'd almost worth separating this into a separate export entirely, but if we have an overload, I'd reuse the property names)

On the other hand, it could be annoying to have spawnAsync reject just because of the amount of output, when the maxBuffer parameter is really intended as a safe guard.

The reason I added the lazy rejection is basically for safe-guarding old calls too (not quite backwards compatibility but in the same spirit). The main motivation was to ensure that:

  • we only hold on to the raw chunks in memory
  • we concat once, to avoid lots of GC work (the string concat was implicitly pretty expensive, if you consider GC too)
  • we sometimes don't even use stdout or stderr, so can clearly let the buffers be freed

I think V8 has special handling of ArrayBuffer memory, so the main thing I wanted to ensure was that the concat is done once (or not at all, when it's not required), and that we don't convert to a string too early, to avoid the small per-chunk strings from being allocated. (Likely more predictable in terms of GC load)

It's possible that in the encoding: 'buffer' mode, small concats are fine, but I didn't immediately test the difference. It's possible we'd want to do something cleverer for that case and maybe even eagerly concat. I haven't benchmarked this though, so it's just a suspicion that different optimisations would apply in that case

tl;dr: I wasn't too concerned about total memory usage (peak RSS) since it'd be very temporary, but about overall memory pressure with small string allocations, which wouldn't apply to encoding: 'buffer'

@ide

ide commented May 22, 2026

Copy link
Copy Markdown
Member

Types: thanks for being a sounding board. Let's go with the option that follows Node's convention. I don't think it's as good of an API (IMO plain greppability with fewer overloads is valuable) but this is also not necessarily a place we want to spend our "creativity budget".

On the point about concatenation and GC costs: this PR addresses this concern by concatenating only when the child process completes. It's not as lazy as a getter, but it also doesn't build up a string until the very end.

I'm not so worried about backwards compatibility because Node would have crashed if spawnAsync read in over 512MB of text and I can't imagine anyone relying on that behavior. To me, the main question is the API ergonomics when maxBuffer is a lower number that the developer expects to cross.

@ide-agent ide-agent force-pushed the worktree-issue-20-encoding-buffer branch from 7af92ba to 796a617 Compare May 22, 2026 23:26
@ide ide self-requested a review May 22, 2026 23:27
@ide-agent ide-agent force-pushed the worktree-issue-20-encoding-buffer branch 5 times, most recently from c9e012f to 71897db Compare May 23, 2026 06:25
Why
===
`spawnAsync` always decodes child output with `toString('utf8')`, which
corrupts binary output, e.g. `pandoc` writing a `.docx` to stdout.
(Closes #20.)

How
===
Add an `encoding` option to `SpawnOptions`. `encoding: 'buffer'` returns
`stdout` / `stderr` / `output` as `Uint8Array`; the default `'utf8'` and
other `BufferEncoding` values are unchanged. `SpawnResult<T = string>`
is now parameterized on the stdio element type and overloads select
between the string and buffer forms; existing `SpawnResult` references
resolve to `SpawnResult<string>` with no source changes.

The result is built once when the process exits, freeing the
intermediate chunks immediately instead of retaining them behind lazy
getters for the result's lifetime. Exceeding the buffer cap rejects in
all cases: an explicit `maxBuffer` already rejected, and the default
cap previously resolved then threw lazily on property access.

A `maxBuffer` larger than the encoding's runtime hard limit
(`MAX_STRING_LENGTH` for text, `MAX_LENGTH` for `'buffer'`) now throws
`TypeError` synchronously at the call site. Previously it was silently
clamped, which led to confusing rejection messages later.

Test Plan
=========
Unit tests; the existing suite passes unchanged. New tests:
- returns stdout/stderr as Uint8Array under encoding: 'buffer'
- survives a non-UTF-8 byte sequence: bytes preserved, not replaced
- populates output as [stdout, stderr] of Uint8Array in buffer mode
- attaches bytes to the error on non-zero exit, like string stdout
- enforces maxBuffer under encoding: 'buffer'
- decodes stdout with latin1, and with hex
- rejects suggesting the string-length / byte-array limit when the
  default cap is exceeded
- throws TypeError synchronously when maxBuffer exceeds the encoding's
  hard limit; accepts a maxBuffer exactly equal to it
@ide-agent ide-agent force-pushed the worktree-issue-20-encoding-buffer branch from 71897db to 56b6ebf Compare May 23, 2026 06:26
@ide

ide commented May 23, 2026

Copy link
Copy Markdown
Member

@kitten three key behaviors now implemented, could you sanity check them?

  • stdout/stderr have polymorphic types, specifically they can be strings or buffers depending on the chosen encoding
  • the default maxBuffer is MAX_STRING_LENGTH even for the "buffer" encoding (because MAX_LENGTH for bytes is MAX_SAFE_INTEGER, not very useful)
  • the promise rejects after the child process exits if the stdout or stderr exceed maxBuffer, regardless of whether maxBuffer was specified or is the implicit default (table in the PR description shows the behavior)

@ide ide requested a review from kitten May 23, 2026 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Buffer output support

3 participants