You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Draft umbrella issue for a small Waybar performance PR stack. The evidence below is from the direct baseline.perf.data vs optimized.perf.data comparison, not from the earlier stripped-binary profile.
Why This Stack
baseline.perf.data is dominated by repeated user-space work in Waybar's hot paths: regex-heavy JSON preparation, Sway full-tree IPC, JsonCpp tree parsing, and GTK label/layout churn. The PRs below target those costs without replacing JsonCpp, rewriting GTK/Pango rendering, or introducing a large Sway backend rewrite.
Baseline hot area
Cost
Why it matters
Targeted by
Sway IPC sendCmd child path
2.66B cycles
event handling repeatedly waits on and processes Sway replies
workspace/window Sway fast paths
getTree child path
1.18B cycles
full Sway tree fetches are expensive when only smaller payloads are needed
workspace/window Sway fast paths
std::regex search/replace
~1.10B child cycles
JSON parsing pays a large common-case regex tax
JSON parser cleanup
std::regex::_M_dfs self
743M cycles
largest Waybar-side self hotspot in the baseline
JSON parser cleanup
Json parse path
834M cycles
parsing full Sway replies dominates residual user-space work
parser and Sway payload reductions
Json array parsing
799M cycles
tree/list parsing is amplified by full-tree IPC replies
Sway payload reductions
GTK layout/draw-ish
213M cycles
no-op label writes can still trigger layout/redraw paths
label markup guard
Important
JsonCpp is still the largest percentage bucket in the optimized profile, but its absolute cost is much lower: roughly 700M cycles to 186M cycles, or -73.4%. The percentage is high because the total profile is much smaller.
PR Stack
The PRs are ordered from smallest/localest to most behavior-sensitive. Each section includes the relevant baseline signal and the corresponding before/after effect where the aggregate comparison can attribute it.
Why this was needed:baseline.perf.data showed the JSON utility paying a large common-case regex tax before parsing. std::regex search/replace accounted for about 1.10B child cycles, and std::regex::_M_dfs alone accounted for 743M self cycles. Json parsing itself was also hot at 834M child cycles.
Change: Parse from the existing string buffer and only run the \x compatibility repair when that escape is present.
Effect in comparison: The regex path disappears from visible hot symbols. Json parse work drops from 834M to 188M cycles, and the overall JsonCpp bucket drops from roughly 700M to 186M cycles.
Why this was needed: MPD was not a dominant bucket in baseline.perf.data; the baseline was dominated by Sway IPC, regex, JSON, and label work. This PR is included because the code had an obvious duplicate periodic fetch/update while the stack was being trimmed down.
Change:Playing::on_timer() already fetches state before checking whether playback is active. Reuse that result instead of calling queryMPD() immediately afterward and emitting twice.
Effect in comparison: This is a small local cleanup rather than a primary aggregate-profile driver. It removes redundant periodic MPD work without changing the larger Sway/JSON result.
Why this was needed:baseline.perf.data had visible GTK/UI update work. The GTK layout/draw-ish path accounted for about 213M cycles, which is avoidable when a module writes identical markup repeatedly.
Change: Cache the last label and tooltip markup and skip Gtk::Label::set_markup() / set_tooltip_markup() when the markup is unchanged.
Effect in comparison: GTK layout/draw-ish work drops from 213M to 17M cycles, indicating fewer no-op relayout/redraw paths.
4. Use smaller Sway workspace replies for simple configs
Why this was needed:baseline.perf.data showed repeated Sway tree/list work: Sway IPC sendCmd child path at 2.66B cycles, getTree child path at 1.18B cycles, Json parse path at 834M cycles, and Json array parsing at 799M cycles.
Change: Use IPC_GET_WORKSPACES when sway/workspaces does not need per-window tree data. Preserve IPC_GET_TREE for window-rewrite, where child window nodes are required.
Effect in comparison: This contributes to the reduction in repeated tree/list parsing. Json array parsing drops from 799M to 145M cycles, while full Sway IPC/tree work is much lower across the optimized stack.
5. Avoid Sway window tree fetches for simple events
Why this was needed: The largest baseline child path was Sway IPC/event processing: sendCmd at 2.66B cycles. Repeated full-tree work was also prominent: getTree at 1.18B cycles, plus Json parse/tree work in the hundreds of millions of cycles.
Change: Use the Sway window event payload for focused-window title/mark updates. Keep IPC_GET_TREE for structural events such as focus, move, close, floating, and workspace changes.
Effect in comparison: This directly attacks the event -> IPC_GET_TREE -> full JsonCpp parse -> tree walk path. sendCmd drops from 2.66B to 214M cycles, and getTree drops from 1.18B to 42M cycles.
Outcome
The optimized build reduces sampled CPU work by about 86% over a comparable 30s capture.
Metric
Baseline
Optimized
Change
Read
Total cycles
3.42B
0.46B
-86.5%
about 7.4x less sampled work
Cycles/sec
114.5M/s
15.9M/s
-86.1%
normalized result is also about 7.2x better
Samples
332
58
-82.5%
far fewer hot samples
Waybar binary bucket
1.03B
22.5M
-97.8%
most app-side hot work removed
Sampling caveat
The optimized profile had only 58 samples, so small symbol-level deltas are noisy. The large reductions above are still strong enough to treat as real because they are visible in absolute cycle counts and align with the removed code paths.
Note
Draft umbrella issue for a small Waybar performance PR stack. The evidence below is from the direct
baseline.perf.datavsoptimized.perf.datacomparison, not from the earlier stripped-binary profile.Why This Stack
baseline.perf.datais dominated by repeated user-space work in Waybar's hot paths: regex-heavy JSON preparation, Sway full-tree IPC, JsonCpp tree parsing, and GTK label/layout churn. The PRs below target those costs without replacing JsonCpp, rewriting GTK/Pango rendering, or introducing a large Sway backend rewrite.sendCmdchild pathgetTreechild pathstd::regexsearch/replacestd::regex::_M_dfsselfImportant
JsonCpp is still the largest percentage bucket in the optimized profile, but its absolute cost is much lower: roughly 700M cycles to 186M cycles, or -73.4%. The percentage is high because the total profile is much smaller.
PR Stack
The PRs are ordered from smallest/localest to most behavior-sensitive. Each section includes the relevant baseline signal and the corresponding before/after effect where the aggregate comparison can attribute it.
1. Parse JSON directly from string buffers
PR: perf(json): parse directly from string buffers
Why this was needed:
baseline.perf.datashowed the JSON utility paying a large common-case regex tax before parsing.std::regexsearch/replace accounted for about 1.10B child cycles, andstd::regex::_M_dfsalone accounted for 743M self cycles. Json parsing itself was also hot at 834M child cycles.Change: Parse from the existing string buffer and only run the
\xcompatibility repair when that escape is present.Effect in comparison: The regex path disappears from visible hot symbols. Json parse work drops from 834M to 188M cycles, and the overall JsonCpp bucket drops from roughly 700M to 186M cycles.
2. Avoid duplicate MPD playing-state updates
PR: perf(mpd): avoid duplicate playing-state updates
Why this was needed: MPD was not a dominant bucket in
baseline.perf.data; the baseline was dominated by Sway IPC, regex, JSON, and label work. This PR is included because the code had an obvious duplicate periodic fetch/update while the stack was being trimmed down.Change:
Playing::on_timer()already fetches state before checking whether playback is active. Reuse that result instead of callingqueryMPD()immediately afterward and emitting twice.Effect in comparison: This is a small local cleanup rather than a primary aggregate-profile driver. It removes redundant periodic MPD work without changing the larger Sway/JSON result.
3. Skip unchanged label and tooltip markup
PR: perf(label): skip redundant markup updates
Why this was needed:
baseline.perf.datahad visible GTK/UI update work. The GTK layout/draw-ish path accounted for about 213M cycles, which is avoidable when a module writes identical markup repeatedly.Change: Cache the last label and tooltip markup and skip
Gtk::Label::set_markup()/set_tooltip_markup()when the markup is unchanged.Effect in comparison: GTK layout/draw-ish work drops from 213M to 17M cycles, indicating fewer no-op relayout/redraw paths.
4. Use smaller Sway workspace replies for simple configs
PR: perf(sway/workspaces): avoid full tree for simple configs
Why this was needed:
baseline.perf.datashowed repeated Sway tree/list work: Sway IPCsendCmdchild path at 2.66B cycles,getTreechild path at 1.18B cycles, Json parse path at 834M cycles, and Json array parsing at 799M cycles.Change: Use
IPC_GET_WORKSPACESwhensway/workspacesdoes not need per-window tree data. PreserveIPC_GET_TREEforwindow-rewrite, where child window nodes are required.Effect in comparison: This contributes to the reduction in repeated tree/list parsing. Json array parsing drops from 799M to 145M cycles, while full Sway IPC/tree work is much lower across the optimized stack.
5. Avoid Sway window tree fetches for simple events
PR: perf(sway/window): avoid tree fetches for simple events
Why this was needed: The largest baseline child path was Sway IPC/event processing:
sendCmdat 2.66B cycles. Repeated full-tree work was also prominent:getTreeat 1.18B cycles, plus Json parse/tree work in the hundreds of millions of cycles.Change: Use the Sway window event payload for focused-window title/mark updates. Keep
IPC_GET_TREEfor structural events such as focus, move, close, floating, and workspace changes.Effect in comparison: This directly attacks the event ->
IPC_GET_TREE-> full JsonCpp parse -> tree walk path.sendCmddrops from 2.66B to 214M cycles, andgetTreedrops from 1.18B to 42M cycles.Outcome
The optimized build reduces sampled CPU work by about 86% over a comparable 30s capture.
Sampling caveat
The optimized profile had only 58 samples, so small symbol-level deltas are noisy. The large reductions above are still strong enough to treat as real because they are visible in absolute cycle counts and align with the removed code paths.