GenSpectrum · fengelniederhammer · Feb 25, 2026 · Apr 27, 2026 · Apr 27, 2026 · Apr 30, 2026
diff --git a/.clang-tidy b/.clang-tidy
@@ -34,9 +34,9 @@ Checks: >-
 # positives for overridden 'T& operator=(T&)'.
 CheckOptions:
   - key: readability-identifier-length.IgnoredVariableNames
-    value: '_|to'
+    value: '_|to|it|op'
   - key: readability-identifier-length.IgnoredParameterNames
-    value: '_|to'
+    value: '_|to|it|op'
   - key: readability-identifier-naming.NamespaceCase
     value: lower_case
   - key: readability-identifier-naming.ClassCase

diff --git a/1238-review/aggregate_node.md b/1238-review/aggregate_node.md
@@ -0,0 +1,40 @@
+# Review: aggregate_node.h / aggregate_node.cpp (PR #1238)
+
+## Overall
+
+Good generalization from hardcoded COUNT → extensible `AggregateFunction` enum + `AggregateDefinition`. Clean Arrow Acero integration. Follows project patterns (public members, `SILO_UNREACHABLE`, `SILO_ASSERT`). Few issues below.
+
+## Findings
+
+### aggregate_node.h
+
+`L22: 🟡 risk: AggregateDefinition::source_column` not validated for COUNT. COUNT ignores it, but caller can pass `source_column="bogus"` silently. When SUM/AVG added, forgetting validation here = silent wrong results. Add `SILO_ASSERT(!source_column.has_value())` in COUNT branch, or validate at construction.
+
+`L17: 🔵 nit: enum AggregateFunction` — single-value enum fine for extensibility scaffolding, but add brief comment like `// Extended by future PRs (SUM, AVG, etc.)` so readers know it's intentional, not dead code.
+
+### aggregate_node.cpp
+
+`L40: 🟡 risk: source_refs` always empty. For COUNT this is correct (`count_all` takes no source), but when SUM/MIN/MAX added, `source_refs` must be populated from `agg.source_column`. Current structure doesn't make this obvious — the empty vector is constructed then moved without any branch populating it. Consider adding a comment or an assert: `SILO_ASSERT(source_refs.empty())` in COUNT branch to make the invariant explicit.
+
+`L32: 🔵 nit: input_schema` param only used in L67 assert. In release builds with asserts compiled out, this becomes an unused parameter. Either `[[maybe_unused]]` or restructure so the schema validation is always active (return error instead of assert).
+
+`L67: 🟡 risk: SILO_ASSERT for schema validation.` `CanReferenceFieldByName` check is debug-only. If group_by field doesn't exist in input schema, release build silently passes bad field ref to Arrow → runtime crash in Arrow internals with unhelpful error. Should be a proper error return (`arrow::Status::Invalid(...)`) not an assert.
+
+`L90: 🔵 nit: uninitialized local` `schema::ColumnType type;` — technically fine because switch covers all enum values and compiler warns on missing cases, but initializing to a sentinel or using a helper function (like `arrowFunctionName` pattern) would be more defensive. If someone adds enum value and forgets this switch, UB from uninitialized read.
+
+`L4-6: 🔵 nit: unused includes.` `<map>`, `<string>`, `<vector>` — these are already transitively included via the header. Not wrong, but the header already includes them. Project style seems to prefer explicit includes so this is fine, just noting.
+
+### Missing
+
+`❓ q: No unit tests for AggregateNode.` `aggregate_node.test.cpp` doesn't exist. The generalization from hardcoded COUNT to configurable aggregates is a behavioral change — should have at least: (1) test COUNT with no groups, (2) test COUNT with groups, (3) test empty aggregates vector, (4) test `getOutputSchema` returns correct types. Integration coverage via e2e tests may exist but unit tests catch regressions faster.
+
+## Summary
+
+| Severity | Count |
+|----------|-------|
+| 🔴 Critical | 0 |
+| 🟡 Risk | 3 |
+| 🔵 Nit | 3 |
+| ❓ Question | 1 |
+
+Main concern: L67 assert-only validation of group_by fields against input schema. Release builds skip this → bad field refs hit Arrow internals. Convert to proper error return.
diff --git a/1238-review/ast.md b/1238-review/ast.md
@@ -0,0 +1,56 @@
+# PR #1238 — ast.h / ast.cpp Review
+
+## Summary
+
+Clean AST design. Variant-based node types, consistent extract/check helpers, good error messages with source locations. Few real issues, mostly precision/consistency concerns.
+
+---
+
+## Findings
+
+### ast.cpp
+
+`L143-153: 🟡 risk: extractFloatLiteral silently converts int64_t→double. int64_t values >2^53 lose precision. Callers (minProportion etc.) unlikely to hit this, but no guard exists. Add range check or document assumption.`
+
+`L241-243 vs L143-153: 🟡 risk: isFloatLiteral() returns false for IntLiteral, but extractFloatLiteral() accepts IntLiteral. Semantic mismatch — caller doing if(isFloatLiteral(x)) extractFloatLiteral(x) works, but if(!isFloatLiteral(x)) doesn't mean extractFloatLiteral will throw. Consider isNumericLiteral() or rename to extractNumericAsFloat().`
+
+`L9-28: 🔵 nit: binaryOpToString — L28 return "?" after switch covering all enum values. Compiler warns on missing enum case already. Replace with std::unreachable() (C++23) or SILO_ASSERT(false) to catch corruption instead of silently returning "?".`
+
+`L143-153: 🔵 nit: extractFloatLiteral uses explicit throw while all other extract* fns use CHECK_SILO_QUERY macro. Inconsistent error-handling style. Use CHECK_SILO_QUERY with a combined holds_alternative check, or document why this one is different.`
+
+`L56-71: 🔵 nit: FunctionCall::toString builds args string via repeated += concatenation. Fine for small arg lists. Consider fmt::join or std::ostringstream if arg counts grow. Not blocking.`
+
+### ast.h
+
+`L99-111: ❓ q: ExpressionVariant has 12 types. std::variant visit generates jump table — fine for correctness. Any profiling data on variant dispatch overhead in hot query paths? If toString() is debug-only, no concern. If extract* called per-row, might matter.`
+
+`L122-130: 🔵 nit: extract* functions return by value (string, vector). Fine for move semantics. extractSetLiteral returns const ref — good. Consider returning std::string_view from extractIdentifierName/extractStringLiteral if callers don't need ownership (avoids copy).`
+
+### ast_to_query.cpp (related — not in review scope but worth noting)
+
+`ast_to_query.cpp:L74,184,284,287,376,401,411,737: 🟡 risk: static_cast<uint32_t>(extractIntLiteral(...)) — int64_t→uint32_t truncation. Negative values or values >UINT32_MAX silently wrap. Should validate range before cast. This is in the caller, not ast.cpp, but the pattern is pervasive and the AST could provide a safe extractUint32Literal().`
+
+### Testing
+
+`🟡 risk: No test file found for ast.h/ast.cpp (no ast.test.cpp). Extract functions have non-trivial logic (type coercion, date validation, set extraction). Unit tests needed — especially for extractFloatLiteral int→double edge cases, extractDateValue with invalid dates, and the isX/extractX semantic contract.`
+
+---
+
+## Good Practices
+
+- SourceLocation in every error message — excellent for user-facing diagnostics
+- `[[nodiscard]]` on all query functions — prevents silent discard bugs
+- `ExpressionPtr` = unique_ptr — clear ownership, no leaks
+- extractDateValue validates via stringToDate32 and propagates error string — thorough
+- extractOptionalDateValue cleanly composes with NullLiteral check — nice pattern
+- Variant-based AST avoids inheritance hierarchy — good modern C++ choice
+
+---
+
+## Verdict
+
+Solid code. Main actionable items:
+1. **isFloatLiteral/extractFloatLiteral semantic mismatch** — rename or add isNumericLiteral
+2. **int64_t→double precision loss** — add guard in extractFloatLiteral
+3. **No unit tests** — add ast.test.cpp
+4. **binaryOpToString unreachable** — use std::unreachable() instead of return "?"
diff --git a/1238-review/ast_to_query.md b/1238-review/ast_to_query.md
@@ -0,0 +1,52 @@
+# Review: `src/silo/query_engine/saneql/ast_to_query.cpp`
+
+## Bugs
+
+L128-131: 🔴 bug: `LESS_THAN` and `LESS_EQUAL` both produce identical `FloatBetween(col, nullopt, value)`. `FloatBetween::compile` uses `Comparator::LESS` (strict `<`), so `x <= 5.0` silently becomes `x < 5.0`. Same issue L133-136: `GREATER_THAN` and `GREATER_EQUAL` both produce `FloatBetween(col, value, nullopt)` which compiles to `>=`, so `x > 5.0` becomes `x >= 5.0`. Fix: either add a `bool inclusive` flag to `FloatBetween`, or use epsilon adjustment, or add separate `FloatLessThan`/`FloatLessEqual` expressions.
+
+L107: 🔴 bug: `value > 0 ? std::optional<uint32_t>(value - 1) : 0` — when `value == 0`, the ternary returns bare `0` (an `int`), not `std::optional<uint32_t>(0)`. This means `x < 0` on unsigned produces `IntBetween(col, nullopt, 0)` which matches rows where `col == 0` — should match nothing. Fix: return `std::nullopt` and wrap in a `False` expression, or return `IntBetween(col, nullopt, std::optional<uint32_t>{})` with both bounds empty then intersect with False.
+
+L112-113: 🔴 bug: `GREATER_THAN` computes `value + 1` — unsigned overflow when `value == UINT32_MAX`. Produces `IntBetween(col, 0, nullopt)` which matches everything. Fix: check for `UINT32_MAX` and return `False` expression.
+
+## Narrowing casts (int64_t → uint32_t without validation)
+
+L74: 🟡 risk: `static_cast<uint32_t>(extractIntLiteral(value_expr))` — `extractIntLiteral` returns `int64_t`. Negative values or values > UINT32_MAX silently truncate. Fix: add range check + `IllegalQueryException`.
+
+L184: 🟡 risk: same pattern in `convertComparisonToFilter`.
+
+L284,287: 🟡 risk: same pattern in `handleBetween`.
+
+L376,401,411: 🟡 risk: same pattern in `handleSymbolEquals`, `handleHasMutation`, `handleInsertionContains`.
+
+L433: 🟡 risk: `static_cast<int>(extractIntLiteral(...))` for `nOf` count — `int64_t` → `int` truncation. Negative count also not validated.
+
+L737: 🟡 risk: same pattern in `handleLimit`.
+
+**Recommendation:** Extract a helper like `extractUint32Literal(expr, param_name)` that validates range `[0, UINT32_MAX]` and throws `IllegalQueryException` on out-of-range. Use everywhere. `getOptionalUint32` (function_registry.cpp:49) already checks `>= 0` but not `<= UINT32_MAX` — fix that too.
+
+## Design
+
+L471-472: ❓ q: bare identifier in filter context becomes `BoolEquals(name, true)`. Intentional? If user writes `.filter(some_column)` where `some_column` is a string column, they get a confusing runtime error instead of a clear "expected boolean expression" message. Consider checking column type or at least documenting this behavior.
+
+L680,701: 🟡 risk: `handleMutations` and `handleInsertions` dispatch `Nucleotide` vs `AminoAcid` via `args.functionName() == "mutations"` / `"insertions"` string comparison. Fragile — rename the registered function and this silently breaks. Fix: register separate handlers, or use an enum/tag passed at registration time.
+
+L723: 🔵 nit: `static_cast<uint32_t>(std::chrono::system_clock::now().time_since_epoch().count())` — truncates a 64-bit nanosecond count to 32 bits. Works as a seed but loses entropy. Consider `std::random_device` or at least cast from `steady_clock` which is monotonic.
+
+## Edge cases
+
+L152: 🟡 risk: `date_val.value() - 1` for `LESS_THAN` on dates — no underflow check. If `date_val` is the minimum representable date, this wraps. Same L160 for `date_val.value() + 1` overflow.
+
+L377: Good — `position > 0` check exists for 1-indexed positions. But L411 (`handleInsertionContains`) has no such check — insertions may be 0-indexed but this should be documented or validated consistently.
+
+## Style
+
+L12-14: 🔵 nit: include order — `parser.h` (internal) appears before `<fmt/format.h>` and `<re2/re2.h>` (external). Per AGENTS.md include order: corresponding header → system → external → internal.
+
+L958-959: 🔵 nit: lines exceed 100-char column limit (clang-format should catch this).
+
+## Good practices
+
+- Registry pattern clean and extensible — adding new functions is one line each.
+- `BoundArguments` abstraction keeps handlers focused on semantics.
+- Error messages include source locations — good for user-facing diagnostics.
+- `CHECK_SILO_QUERY` used consistently for validation.
diff --git a/1238-review/database.md b/1238-review/database.md
@@ -0,0 +1,51 @@
+# PR #1238 Review: `src/silo/database.cpp`
+
+## Summary
+
+Migration from JSON-based query interface to SaneQL. Three main changes: `getFilteredBitmap` uses SaneQL parser, `getPrevalentMutations` builds SaneQL query string via `fmt::format`, `executeQueryAsArrowIpc` simplified to single `query_string` param. Overall direction good — removes 11 action/binder includes, simplifies code paths.
+
+---
+
+## Findings
+
+### `getPrevalentMutations` (L276–L312)
+
+L286-293: 🟡 risk: **SaneQL injection via `filter` parameter.** `filter` is user-provided string interpolated raw into SaneQL query via `fmt::format`. If `filter` contains `)` or `.` it can break out of `.filter(...)` and chain arbitrary pipeline operations. Example: `filter = "true).project({primaryKey})"` → `table.filter(true).project({primaryKey}).mutations(...)`. The SaneQL parser will happily parse this as a valid pipeline. Unlike the old code path which parsed `filter` as a standalone JSON expression, here it's string-concatenated into a larger query. Fix: parse `filter` into an AST separately (like `getFilteredBitmap` does at L265-267), then pass the filter expression object directly to the planner instead of round-tripping through string concatenation. Alternatively, use `planSaneqlQuery` only for the full pipeline and construct the `MutationsNode` directly as the old code did.
+
+L286-293: 🔵 nit: `table_name` and `sequence_name` also interpolated raw. These come from internal callers (L320-321, L331) so lower risk, but same injection class. If any caller ever passes user-controlled table/sequence names, same problem applies.
+
+L303: 🟡 risk: `result_stream >> json_line` reads whitespace-delimited tokens, not lines. Works today because `NdjsonSink` emits compact JSON (no spaces). But if NDJSON output ever includes spaces (e.g. string values with spaces like `"mutation":"A 123 T"`), `>>` will split one JSON object across multiple reads → parse failures. Fix: use `std::getline(result_stream, json_line)` instead.
+
+L306: 🔵 nit: **Type mismatch.** `count` field is `ColumnType::INT32` in `MutationsNode` (mutations_node.h:84), serialized via `Int32Array`. Parsed here as `uint64_t`. Works for positive values but semantically wrong. Should be `int32_t` or `uint32_t` to match Arrow schema. Pre-existing issue (old code used `SymbolMutations::COUNT_FIELD_NAME` but same `uint64_t` type), so not a regression — but worth fixing while touching this code.
+
+L305,307: 🔵 nit: Hardcoded `"count"` and `"mutation"` strings replace `SymbolMutations::COUNT_FIELD_NAME` / `MUTATION_FIELD_NAME` constants. Loses compile-time coupling — if field names change in `MutationsNode`, these will silently break at runtime. Consider using `operators::MutationsNode<SymbolType>::COUNT_FIELD_NAME` and `MUTATION_FIELD_NAME` constants (they're still defined in mutations_node.h:21,28).
+
+L283-284: ✅ Good: `constexpr std::string_view` with `if constexpr`-style ternary for selecting mutation function name. Clean pattern.
+
+### `getFilteredBitmap` (L254–L274)
+
+L265-267: ❓ q: `getFilteredBitmap` parses `filter` as standalone SaneQL expression via `Parser` + `convertToFilter`. This is the safe approach (no injection possible — parser validates syntax, `convertToFilter` only accepts filter-context AST nodes). Why doesn't `getPrevalentMutations` use the same pattern? The asymmetry is suspicious.
+
+L260-262: 🔵 nit: Table-not-found returns empty `Roaring{}` with `SPDLOG_ERROR`. Other methods (L185, L208, L234) throw `std::runtime_error`. Inconsistent error handling. Pre-existing, not introduced by this PR.
+
+### `executeQueryAsArrowIpc` (L480–L496)
+
+L480: ✅ Good: Clean simplification. Single `query_string` param, delegates to `planSaneqlQuery`. No injection concern here because caller passes full query — no string interpolation.
+
+L489-493: ✅ Good: Proper Arrow status checking with descriptive error message.
+
+### Includes (L1–L38)
+
+L35-36: ✅ Good: Clean swap — 11 old action/binder includes replaced by 2 saneql includes. Reduces coupling.
+
+L30: ❓ q: `filter/expressions/true.h` still included for `printAllData` (L192). Correct, just noting it's not dead.
+
+---
+
+## Verdict
+
+**Main concern:** SaneQL injection in `getPrevalentMutations` (L286-293). `filter` is user-provided and interpolated raw into query string. Should parse filter separately like `getFilteredBitmap` does, or construct operator tree directly.
+
+**Secondary:** `>>` vs `getline` for NDJSON parsing (L303) is fragile. Low risk today, time bomb tomorrow.
+
+Rest of changes clean and well-structured.
diff --git a/1238-review/database_pyx.md b/1238-review/database_pyx.md
@@ -0,0 +1,51 @@
+# Review: `database.pyx` + `database.pxd` — PR #1238
+
+## .pxd ↔ C++ Header Signature Match
+
+✅ **`executeQueryAsArrowIpc`**: `.pxd` L31 declares `string executeQueryAsArrowIpc(string query_string)` — matches C++ `std::string executeQueryAsArrowIpc(const std::string& query_string) const` at `database.h:107`. Single-arg signature correct after `table_name` removal.
+
+✅ **`getFilteredBitmap`**: `.pxd` L29 matches C++ L71. Two args: `table_name`, `filter`.
+
+✅ **`getPrevalentNucMutations` / `getPrevalentAminoAcidMutations`**: `.pxd` L27-28 match C++ L81-93. Four args each.
+
+✅ **Exception handling**: `.pxd` uses `except +handle_silo_exception` for query methods — correct pattern.
+
+## Findings
+
+### 🟡 risk — `database.pyx:L520`: Inconsistent IPC buffer conversion
+
+`execute_query` passes `ipc_buffer` (C++ `std::string`) directly to `pa.BufferReader(ipc_buffer)`. Cython auto-converts `std::string` → Python `bytes`, so this *works*, but `get_tables` at L76 does explicit `(<char*> ipc_buffer.data())[:ipc_buffer.size()]` conversion. Inconsistent pattern. The explicit cast avoids an extra copy in some Cython versions. Pick one pattern, use everywhere.
+
+### 🔵 nit — `database.pyx:L295`: Wrong return type in docstring
+
+`get_amino_acid_reference_sequence` docstring says `Returns: str — The nucleotide reference sequence`. Should say "amino acid reference sequence". Copy-paste from `get_nucleotide_reference_sequence`.
+
+### 🔵 nit — `database.pyx:L315-329,L373-391`: Docstrings don't mention SaneQL
+
+`get_prevalent_nucleotide_mutations` and `get_prevalent_amino_acid_mutations` docstrings say `filter_expression : str, optional — Filter expression to apply (default: "")` but don't specify this is now SaneQL syntax. Compare with `get_filtered_bitmap` L440 which correctly says "SaneQL filter expression". Should be consistent across all filter-accepting methods.
+
+### 🔵 nit — `database.pyx:L347,L405`: Comment says "True filter" — ambiguous
+
+Comments say `# Default to True filter (returns all rows) if no filter specified`. After SaneQL migration, worth clarifying this is SaneQL `true` literal, not JSON `{"type": "True"}`. Same comment at L456.
+
+### ✅ Good — Default filter `'true'` is valid SaneQL
+
+`'true'` is valid SaneQL boolean literal. Default behavior preserved across JSON→SaneQL transition. No breaking change for callers using default.
+
+### ✅ Good — `execute_query` docstring updated
+
+L493-494 correctly documents SaneQL syntax with examples like `'sequences.filter(true)'`. This is the right pattern.
+
+### ❓ q — `database.pyx:L431`: `get_filtered_bitmap` takes `table_name` separately
+
+`get_filtered_bitmap` takes `table_name` as separate param + SaneQL `filter_expression`. But `execute_query` takes full SaneQL query where table name is embedded (e.g. `"sequences.filter(true)"`). Is this intentional asymmetry? `getFilteredBitmap` C++ signature confirms it takes separate `table_name` + `filter` — so this is correct, but the two APIs have different mental models. Worth a note in docstring that `filter_expression` here is just the filter part, not a full SaneQL query.
+
+### ✅ Good — Breaking change handling
+
+`executeQueryAsArrowIpc` signature change (removed `table_name`) is correctly reflected. Python `execute_query` now takes single `query_string` with table name embedded in SaneQL. This is a breaking change for Python API users who were passing table_name separately — but since this is a `!` (breaking) PR, that's expected.
+
+## Summary
+
+**No bugs found.** Signatures match. Default `'true'` valid. Two nits on docstrings (copy-paste error, missing SaneQL mention). One minor inconsistency in IPC buffer handling pattern. Code is clean and well-structured.
+
+Verdict: **Ship it** (after docstring fixes).