Skip to content

fix(logger): preserve traceback in JSON sink#2594

Open
mikasenghaas wants to merge 3 commits into
mainfrom
fix/json-logger-traceback-with-enqueue
Open

fix(logger): preserve traceback in JSON sink#2594
mikasenghaas wants to merge 3 commits into
mainfrom
fix/json-logger-traceback-with-enqueue

Conversation

@mikasenghaas
Copy link
Copy Markdown
Member

@mikasenghaas mikasenghaas commented May 22, 2026

Summary

  • Pre-format the loguru traceback into record["extra"]["traceback"] via a traceback_patcher that runs at log time, before enqueue=True pickles the record. The JSON sink reads it back into the "exception" field, restoring full stack frames in JSON logs.
  • Keeps enqueue=True — required for multiprocess safety of trainer torchrun workers sharing stdout.

Why

When --log.json_logging is enabled, exception logs lose their stack frames. Reproduction with the orchestrator entrypoint produced:

{"level": "ERROR", "message": "Fatal error in orchestrate", "exception": "RuntimeError: ...\n"}

No Traceback (most recent call last): header, no frames — just the exception type + message.

Root cause

loguru/_recattrs.py:68RecordException.__reduce__ unconditionally sets the traceback to None before pickling, because traceback objects aren't picklable:

def __reduce__(self):
    try:
        pickled_value = pickle.dumps(self.value)
    except Exception:
        return (RecordException, (self.type, None, None))
    else:
        return (RecordException._from_pickled_value, (self.type, pickled_value, None))
                                                                                  # ^^^^ traceback dropped

With enqueue=True the record is pickled to cross the queue, so by the time json_sink receives it both exc.traceback and exc.value.__traceback__ are None, and traceback.format_exception(...) renders only the head line.

Verification

Injected raise RuntimeError("...") at the start of orchestrate(), ran orchestrator @ configs/debug/orch.toml --log.json_logging:

Before:

{"level": "ERROR", "exception": "RuntimeError: ...\n"}

After:

{"level": "ERROR", "exception": "Traceback (most recent call last):\n  File \".../utils.py\", line 59, in async_wrapper\n    ret = await func(*args, **kwargs)\n  File \".../orchestrator.py\", line 96, in orchestrate\n    raise RuntimeError(...)\nRuntimeError: ...\n"}

Why not drop enqueue=True?

Per loguru docs:

"Whether the messages to be logged should first pass through a multiprocessing-safe queue before reaching the sink. This is useful while logging to a file through multiple processes."

The trainer's torchrun workers all write JSON lines to a shared inherited stdout. Without the queue, concurrent writes from different processes can interleave at byte boundaries and corrupt JSON lines. Loguru's per-sink lock handles intra-process threading but is not multiprocess-safe.


Note

Medium Risk
Changes how exceptions are captured and serialized in JSON logging mode (via Loguru patchers and extra mutation), which could affect error visibility and structured log fields across async/multiprocess logging.

Overview
Fixes JSON logging so exception stack traces are preserved when the sink runs with enqueue=True.

This pre-formats the traceback at log time via a new traceback_patcher (stored in record["extra"]["traceback"]) and updates build_log_entry() to prefer that preformatted traceback when populating the JSON exception field, falling back to formatting record["exception"] only when needed.

Reviewed by Cursor Bugbot for commit 7d6b55b. Bugbot is set up for automated code reviews on this repo. Configure here.

loguru's RecordException.__reduce__ unconditionally drops the
traceback object before pickling, so records dequeued by the
enqueue=True worker have exc.traceback=None and the JSON sink
emits only the exception type + message - no frames.

Pre-format the traceback into extra["_traceback"] via a loguru
patcher that runs at log time, while the live traceback is still
attached. The string survives pickling, and build_log_entry reads
it back into the "exception" field.

enqueue=True is kept because the trainer's torchrun workers all
write JSON lines to a shared inherited stdout, and loguru's docs
flag this as the case where the multiprocessing-safe queue is
required to avoid interleaved/corrupt output.
@mikasenghaas mikasenghaas requested review from JannikSt and samsja May 22, 2026 02:40
@mikasenghaas mikasenghaas changed the title fix(logger): preserve traceback in JSON sink with enqueue=True fix(logger): preserve traceback in JSON sink May 22, 2026
@mikasenghaas mikasenghaas marked this pull request as ready for review May 22, 2026 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant