Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 2 additions & 2 deletions .clang-tidy
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,9 @@ Checks: >-
# positives for overridden 'T& operator=(T&)'.
CheckOptions:
- key: readability-identifier-length.IgnoredVariableNames
value: '_|to'
value: '_|to|it|op'
- key: readability-identifier-length.IgnoredParameterNames
value: '_|to'
value: '_|to|it|op'
- key: readability-identifier-naming.NamespaceCase
value: lower_case
- key: readability-identifier-naming.ClassCase
Expand Down
40 changes: 40 additions & 0 deletions 1238-review/aggregate_node.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Review: aggregate_node.h / aggregate_node.cpp (PR #1238)

## Overall

Good generalization from hardcoded COUNT → extensible `AggregateFunction` enum + `AggregateDefinition`. Clean Arrow Acero integration. Follows project patterns (public members, `SILO_UNREACHABLE`, `SILO_ASSERT`). Few issues below.

## Findings

### aggregate_node.h

`L22: 🟡 risk: AggregateDefinition::source_column` not validated for COUNT. COUNT ignores it, but caller can pass `source_column="bogus"` silently. When SUM/AVG added, forgetting validation here = silent wrong results. Add `SILO_ASSERT(!source_column.has_value())` in COUNT branch, or validate at construction.

`L17: 🔵 nit: enum AggregateFunction` — single-value enum fine for extensibility scaffolding, but add brief comment like `// Extended by future PRs (SUM, AVG, etc.)` so readers know it's intentional, not dead code.

### aggregate_node.cpp

`L40: 🟡 risk: source_refs` always empty. For COUNT this is correct (`count_all` takes no source), but when SUM/MIN/MAX added, `source_refs` must be populated from `agg.source_column`. Current structure doesn't make this obvious — the empty vector is constructed then moved without any branch populating it. Consider adding a comment or an assert: `SILO_ASSERT(source_refs.empty())` in COUNT branch to make the invariant explicit.

`L32: 🔵 nit: input_schema` param only used in L67 assert. In release builds with asserts compiled out, this becomes an unused parameter. Either `[[maybe_unused]]` or restructure so the schema validation is always active (return error instead of assert).

`L67: 🟡 risk: SILO_ASSERT for schema validation.` `CanReferenceFieldByName` check is debug-only. If group_by field doesn't exist in input schema, release build silently passes bad field ref to Arrow → runtime crash in Arrow internals with unhelpful error. Should be a proper error return (`arrow::Status::Invalid(...)`) not an assert.

`L90: 🔵 nit: uninitialized local` `schema::ColumnType type;` — technically fine because switch covers all enum values and compiler warns on missing cases, but initializing to a sentinel or using a helper function (like `arrowFunctionName` pattern) would be more defensive. If someone adds enum value and forgets this switch, UB from uninitialized read.

`L4-6: 🔵 nit: unused includes.` `<map>`, `<string>`, `<vector>` — these are already transitively included via the header. Not wrong, but the header already includes them. Project style seems to prefer explicit includes so this is fine, just noting.

### Missing

`❓ q: No unit tests for AggregateNode.` `aggregate_node.test.cpp` doesn't exist. The generalization from hardcoded COUNT to configurable aggregates is a behavioral change — should have at least: (1) test COUNT with no groups, (2) test COUNT with groups, (3) test empty aggregates vector, (4) test `getOutputSchema` returns correct types. Integration coverage via e2e tests may exist but unit tests catch regressions faster.

## Summary

| Severity | Count |
|----------|-------|
| 🔴 Critical | 0 |
| 🟡 Risk | 3 |
| 🔵 Nit | 3 |
| ❓ Question | 1 |

Main concern: L67 assert-only validation of group_by fields against input schema. Release builds skip this → bad field refs hit Arrow internals. Convert to proper error return.
56 changes: 56 additions & 0 deletions 1238-review/ast.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# PR #1238 — ast.h / ast.cpp Review

## Summary

Clean AST design. Variant-based node types, consistent extract/check helpers, good error messages with source locations. Few real issues, mostly precision/consistency concerns.

---

## Findings

### ast.cpp

`L143-153: 🟡 risk: extractFloatLiteral silently converts int64_t→double. int64_t values >2^53 lose precision. Callers (minProportion etc.) unlikely to hit this, but no guard exists. Add range check or document assumption.`

`L241-243 vs L143-153: 🟡 risk: isFloatLiteral() returns false for IntLiteral, but extractFloatLiteral() accepts IntLiteral. Semantic mismatch — caller doing if(isFloatLiteral(x)) extractFloatLiteral(x) works, but if(!isFloatLiteral(x)) doesn't mean extractFloatLiteral will throw. Consider isNumericLiteral() or rename to extractNumericAsFloat().`

`L9-28: 🔵 nit: binaryOpToString — L28 return "?" after switch covering all enum values. Compiler warns on missing enum case already. Replace with std::unreachable() (C++23) or SILO_ASSERT(false) to catch corruption instead of silently returning "?".`

`L143-153: 🔵 nit: extractFloatLiteral uses explicit throw while all other extract* fns use CHECK_SILO_QUERY macro. Inconsistent error-handling style. Use CHECK_SILO_QUERY with a combined holds_alternative check, or document why this one is different.`

`L56-71: 🔵 nit: FunctionCall::toString builds args string via repeated += concatenation. Fine for small arg lists. Consider fmt::join or std::ostringstream if arg counts grow. Not blocking.`

### ast.h

`L99-111: ❓ q: ExpressionVariant has 12 types. std::variant visit generates jump table — fine for correctness. Any profiling data on variant dispatch overhead in hot query paths? If toString() is debug-only, no concern. If extract* called per-row, might matter.`

`L122-130: 🔵 nit: extract* functions return by value (string, vector). Fine for move semantics. extractSetLiteral returns const ref — good. Consider returning std::string_view from extractIdentifierName/extractStringLiteral if callers don't need ownership (avoids copy).`

### ast_to_query.cpp (related — not in review scope but worth noting)

`ast_to_query.cpp:L74,184,284,287,376,401,411,737: 🟡 risk: static_cast<uint32_t>(extractIntLiteral(...)) — int64_t→uint32_t truncation. Negative values or values >UINT32_MAX silently wrap. Should validate range before cast. This is in the caller, not ast.cpp, but the pattern is pervasive and the AST could provide a safe extractUint32Literal().`

### Testing

`🟡 risk: No test file found for ast.h/ast.cpp (no ast.test.cpp). Extract functions have non-trivial logic (type coercion, date validation, set extraction). Unit tests needed — especially for extractFloatLiteral int→double edge cases, extractDateValue with invalid dates, and the isX/extractX semantic contract.`

---

## Good Practices

- SourceLocation in every error message — excellent for user-facing diagnostics
- `[[nodiscard]]` on all query functions — prevents silent discard bugs
- `ExpressionPtr` = unique_ptr — clear ownership, no leaks
- extractDateValue validates via stringToDate32 and propagates error string — thorough
- extractOptionalDateValue cleanly composes with NullLiteral check — nice pattern
- Variant-based AST avoids inheritance hierarchy — good modern C++ choice

---

## Verdict

Solid code. Main actionable items:
1. **isFloatLiteral/extractFloatLiteral semantic mismatch** — rename or add isNumericLiteral
2. **int64_t→double precision loss** — add guard in extractFloatLiteral
3. **No unit tests** — add ast.test.cpp
4. **binaryOpToString unreachable** — use std::unreachable() instead of return "?"
52 changes: 52 additions & 0 deletions 1238-review/ast_to_query.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Review: `src/silo/query_engine/saneql/ast_to_query.cpp`

## Bugs

L128-131: 🔴 bug: `LESS_THAN` and `LESS_EQUAL` both produce identical `FloatBetween(col, nullopt, value)`. `FloatBetween::compile` uses `Comparator::LESS` (strict `<`), so `x <= 5.0` silently becomes `x < 5.0`. Same issue L133-136: `GREATER_THAN` and `GREATER_EQUAL` both produce `FloatBetween(col, value, nullopt)` which compiles to `>=`, so `x > 5.0` becomes `x >= 5.0`. Fix: either add a `bool inclusive` flag to `FloatBetween`, or use epsilon adjustment, or add separate `FloatLessThan`/`FloatLessEqual` expressions.

L107: 🔴 bug: `value > 0 ? std::optional<uint32_t>(value - 1) : 0` — when `value == 0`, the ternary returns bare `0` (an `int`), not `std::optional<uint32_t>(0)`. This means `x < 0` on unsigned produces `IntBetween(col, nullopt, 0)` which matches rows where `col == 0` — should match nothing. Fix: return `std::nullopt` and wrap in a `False` expression, or return `IntBetween(col, nullopt, std::optional<uint32_t>{})` with both bounds empty then intersect with False.

L112-113: 🔴 bug: `GREATER_THAN` computes `value + 1` — unsigned overflow when `value == UINT32_MAX`. Produces `IntBetween(col, 0, nullopt)` which matches everything. Fix: check for `UINT32_MAX` and return `False` expression.

## Narrowing casts (int64_t → uint32_t without validation)

L74: 🟡 risk: `static_cast<uint32_t>(extractIntLiteral(value_expr))` — `extractIntLiteral` returns `int64_t`. Negative values or values > UINT32_MAX silently truncate. Fix: add range check + `IllegalQueryException`.

L184: 🟡 risk: same pattern in `convertComparisonToFilter`.

L284,287: 🟡 risk: same pattern in `handleBetween`.

L376,401,411: 🟡 risk: same pattern in `handleSymbolEquals`, `handleHasMutation`, `handleInsertionContains`.

L433: 🟡 risk: `static_cast<int>(extractIntLiteral(...))` for `nOf` count — `int64_t` → `int` truncation. Negative count also not validated.

L737: 🟡 risk: same pattern in `handleLimit`.

**Recommendation:** Extract a helper like `extractUint32Literal(expr, param_name)` that validates range `[0, UINT32_MAX]` and throws `IllegalQueryException` on out-of-range. Use everywhere. `getOptionalUint32` (function_registry.cpp:49) already checks `>= 0` but not `<= UINT32_MAX` — fix that too.

## Design

L471-472: ❓ q: bare identifier in filter context becomes `BoolEquals(name, true)`. Intentional? If user writes `.filter(some_column)` where `some_column` is a string column, they get a confusing runtime error instead of a clear "expected boolean expression" message. Consider checking column type or at least documenting this behavior.

L680,701: 🟡 risk: `handleMutations` and `handleInsertions` dispatch `Nucleotide` vs `AminoAcid` via `args.functionName() == "mutations"` / `"insertions"` string comparison. Fragile — rename the registered function and this silently breaks. Fix: register separate handlers, or use an enum/tag passed at registration time.

L723: 🔵 nit: `static_cast<uint32_t>(std::chrono::system_clock::now().time_since_epoch().count())` — truncates a 64-bit nanosecond count to 32 bits. Works as a seed but loses entropy. Consider `std::random_device` or at least cast from `steady_clock` which is monotonic.

## Edge cases

L152: 🟡 risk: `date_val.value() - 1` for `LESS_THAN` on dates — no underflow check. If `date_val` is the minimum representable date, this wraps. Same L160 for `date_val.value() + 1` overflow.

L377: Good — `position > 0` check exists for 1-indexed positions. But L411 (`handleInsertionContains`) has no such check — insertions may be 0-indexed but this should be documented or validated consistently.

## Style

L12-14: 🔵 nit: include order — `parser.h` (internal) appears before `<fmt/format.h>` and `<re2/re2.h>` (external). Per AGENTS.md include order: corresponding header → system → external → internal.

L958-959: 🔵 nit: lines exceed 100-char column limit (clang-format should catch this).

## Good practices

- Registry pattern clean and extensible — adding new functions is one line each.
- `BoundArguments` abstraction keeps handlers focused on semantics.
- Error messages include source locations — good for user-facing diagnostics.
- `CHECK_SILO_QUERY` used consistently for validation.
51 changes: 51 additions & 0 deletions 1238-review/database.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# PR #1238 Review: `src/silo/database.cpp`

## Summary

Migration from JSON-based query interface to SaneQL. Three main changes: `getFilteredBitmap` uses SaneQL parser, `getPrevalentMutations` builds SaneQL query string via `fmt::format`, `executeQueryAsArrowIpc` simplified to single `query_string` param. Overall direction good — removes 11 action/binder includes, simplifies code paths.

---

## Findings

### `getPrevalentMutations` (L276–L312)

L286-293: 🟡 risk: **SaneQL injection via `filter` parameter.** `filter` is user-provided string interpolated raw into SaneQL query via `fmt::format`. If `filter` contains `)` or `.` it can break out of `.filter(...)` and chain arbitrary pipeline operations. Example: `filter = "true).project({primaryKey})"` → `table.filter(true).project({primaryKey}).mutations(...)`. The SaneQL parser will happily parse this as a valid pipeline. Unlike the old code path which parsed `filter` as a standalone JSON expression, here it's string-concatenated into a larger query. Fix: parse `filter` into an AST separately (like `getFilteredBitmap` does at L265-267), then pass the filter expression object directly to the planner instead of round-tripping through string concatenation. Alternatively, use `planSaneqlQuery` only for the full pipeline and construct the `MutationsNode` directly as the old code did.

L286-293: 🔵 nit: `table_name` and `sequence_name` also interpolated raw. These come from internal callers (L320-321, L331) so lower risk, but same injection class. If any caller ever passes user-controlled table/sequence names, same problem applies.

L303: 🟡 risk: `result_stream >> json_line` reads whitespace-delimited tokens, not lines. Works today because `NdjsonSink` emits compact JSON (no spaces). But if NDJSON output ever includes spaces (e.g. string values with spaces like `"mutation":"A 123 T"`), `>>` will split one JSON object across multiple reads → parse failures. Fix: use `std::getline(result_stream, json_line)` instead.

L306: 🔵 nit: **Type mismatch.** `count` field is `ColumnType::INT32` in `MutationsNode` (mutations_node.h:84), serialized via `Int32Array`. Parsed here as `uint64_t`. Works for positive values but semantically wrong. Should be `int32_t` or `uint32_t` to match Arrow schema. Pre-existing issue (old code used `SymbolMutations::COUNT_FIELD_NAME` but same `uint64_t` type), so not a regression — but worth fixing while touching this code.

L305,307: 🔵 nit: Hardcoded `"count"` and `"mutation"` strings replace `SymbolMutations::COUNT_FIELD_NAME` / `MUTATION_FIELD_NAME` constants. Loses compile-time coupling — if field names change in `MutationsNode`, these will silently break at runtime. Consider using `operators::MutationsNode<SymbolType>::COUNT_FIELD_NAME` and `MUTATION_FIELD_NAME` constants (they're still defined in mutations_node.h:21,28).

L283-284: ✅ Good: `constexpr std::string_view` with `if constexpr`-style ternary for selecting mutation function name. Clean pattern.

### `getFilteredBitmap` (L254–L274)

L265-267: ❓ q: `getFilteredBitmap` parses `filter` as standalone SaneQL expression via `Parser` + `convertToFilter`. This is the safe approach (no injection possible — parser validates syntax, `convertToFilter` only accepts filter-context AST nodes). Why doesn't `getPrevalentMutations` use the same pattern? The asymmetry is suspicious.

L260-262: 🔵 nit: Table-not-found returns empty `Roaring{}` with `SPDLOG_ERROR`. Other methods (L185, L208, L234) throw `std::runtime_error`. Inconsistent error handling. Pre-existing, not introduced by this PR.

### `executeQueryAsArrowIpc` (L480–L496)

L480: ✅ Good: Clean simplification. Single `query_string` param, delegates to `planSaneqlQuery`. No injection concern here because caller passes full query — no string interpolation.

L489-493: ✅ Good: Proper Arrow status checking with descriptive error message.

### Includes (L1–L38)

L35-36: ✅ Good: Clean swap — 11 old action/binder includes replaced by 2 saneql includes. Reduces coupling.

L30: ❓ q: `filter/expressions/true.h` still included for `printAllData` (L192). Correct, just noting it's not dead.

---

## Verdict

**Main concern:** SaneQL injection in `getPrevalentMutations` (L286-293). `filter` is user-provided and interpolated raw into query string. Should parse filter separately like `getFilteredBitmap` does, or construct operator tree directly.

**Secondary:** `>>` vs `getline` for NDJSON parsing (L303) is fragile. Low risk today, time bomb tomorrow.

Rest of changes clean and well-structured.
51 changes: 51 additions & 0 deletions 1238-review/database_pyx.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Review: `database.pyx` + `database.pxd` — PR #1238

## .pxd ↔ C++ Header Signature Match

✅ **`executeQueryAsArrowIpc`**: `.pxd` L31 declares `string executeQueryAsArrowIpc(string query_string)` — matches C++ `std::string executeQueryAsArrowIpc(const std::string& query_string) const` at `database.h:107`. Single-arg signature correct after `table_name` removal.

✅ **`getFilteredBitmap`**: `.pxd` L29 matches C++ L71. Two args: `table_name`, `filter`.

✅ **`getPrevalentNucMutations` / `getPrevalentAminoAcidMutations`**: `.pxd` L27-28 match C++ L81-93. Four args each.

✅ **Exception handling**: `.pxd` uses `except +handle_silo_exception` for query methods — correct pattern.

## Findings

### 🟡 risk — `database.pyx:L520`: Inconsistent IPC buffer conversion

`execute_query` passes `ipc_buffer` (C++ `std::string`) directly to `pa.BufferReader(ipc_buffer)`. Cython auto-converts `std::string` → Python `bytes`, so this *works*, but `get_tables` at L76 does explicit `(<char*> ipc_buffer.data())[:ipc_buffer.size()]` conversion. Inconsistent pattern. The explicit cast avoids an extra copy in some Cython versions. Pick one pattern, use everywhere.

### 🔵 nit — `database.pyx:L295`: Wrong return type in docstring

`get_amino_acid_reference_sequence` docstring says `Returns: str — The nucleotide reference sequence`. Should say "amino acid reference sequence". Copy-paste from `get_nucleotide_reference_sequence`.

### 🔵 nit — `database.pyx:L315-329,L373-391`: Docstrings don't mention SaneQL

`get_prevalent_nucleotide_mutations` and `get_prevalent_amino_acid_mutations` docstrings say `filter_expression : str, optional — Filter expression to apply (default: "")` but don't specify this is now SaneQL syntax. Compare with `get_filtered_bitmap` L440 which correctly says "SaneQL filter expression". Should be consistent across all filter-accepting methods.

### 🔵 nit — `database.pyx:L347,L405`: Comment says "True filter" — ambiguous

Comments say `# Default to True filter (returns all rows) if no filter specified`. After SaneQL migration, worth clarifying this is SaneQL `true` literal, not JSON `{"type": "True"}`. Same comment at L456.

### ✅ Good — Default filter `'true'` is valid SaneQL

`'true'` is valid SaneQL boolean literal. Default behavior preserved across JSON→SaneQL transition. No breaking change for callers using default.

### ✅ Good — `execute_query` docstring updated

L493-494 correctly documents SaneQL syntax with examples like `'sequences.filter(true)'`. This is the right pattern.

### ❓ q — `database.pyx:L431`: `get_filtered_bitmap` takes `table_name` separately

`get_filtered_bitmap` takes `table_name` as separate param + SaneQL `filter_expression`. But `execute_query` takes full SaneQL query where table name is embedded (e.g. `"sequences.filter(true)"`). Is this intentional asymmetry? `getFilteredBitmap` C++ signature confirms it takes separate `table_name` + `filter` — so this is correct, but the two APIs have different mental models. Worth a note in docstring that `filter_expression` here is just the filter part, not a full SaneQL query.

### ✅ Good — Breaking change handling

`executeQueryAsArrowIpc` signature change (removed `table_name`) is correctly reflected. Python `execute_query` now takes single `query_string` with table name embedded in SaneQL. This is a breaking change for Python API users who were passing table_name separately — but since this is a `!` (breaking) PR, that's expected.

## Summary

**No bugs found.** Signatures match. Default `'true'` valid. Two nits on docstrings (copy-paste error, missing SaneQL mention). One minor inconsistency in IPC buffer handling pattern. Code is clean and well-structured.

Verdict: **Ship it** (after docstring fixes).
Loading
Loading