Skip to content

Conversation

SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Sep 19, 2025

Description

This PR updates schema-based search in CLP to leverage the new log_surgeon::Query class to improve accuracy on schema compressed archives. This class improves accuracy by performing DFA intersections between a user's query and the variables in the schema. When generating subqueries using this class the following must be done:

  • Check if each QueryInterpretation matches the logtype and variable dictionaries.
  • Transform the QueryInterpretation into a set of SubQuerys if it matches the dictionaries.

To compare against the dictionaries, CLP must consider the various combinations of encodable variable types a single QueryInterpretation can have. For example:

  • Search query: "a *1 *2 b",
  • QueryInterpretation (one of many): "a (*1) (*2) b"
  • Possible logtypes (for the above interpretation):
    • mask 00 -> "a \d \d b"
    • mask 01 ->"a \d \f b"
    • mask 10 ->"a \i \d b"
    • mask 11 ->"a \i \f b"

Key changes:

  • process_raw_query: updated to use the log_surgeon::Query class for DFA-based matching.
  • Added methods to support sub query generation:
    • generate_schema_sub_queries: iterate over all encodable combinations of interpretations and add a sub query if it matches the dictionaries.
    • generate_logtype_string: generates a logtype string for a given interpretation and encodable variable mask
    • get_wildcard_encodable_positions: converts mask for encodable variables to array positions.
    • process_schema_var_token: adds variable to the sub query.
  • Removes unused code: get_bounds_of_next_potential_var overload (schema case) and SearchToken.

Validation performed

  • Unit-tests covering each new method.

Summary by CodeRabbit

  • New Features
    • Schema-driven sub-query generation, log-type string construction and explicit wildcard-position handling (new normalization and generation helpers).
  • Refactor
    • Query-processing flow reorganized to normalize interpretations and run schema-based paths alongside existing heuristics.
  • Bug Fixes
    • Defaulted wildcard-match flag to avoid unpredictable sub-query initialization.
  • API
    • New helpers for wildcard/schema processing and equality operators for query types; testing hooks exposed.
  • Tests
    • Comprehensive in-memory unit tests, test doubles and public testing wrappers for schema/wildcard scenarios.
  • Chores
    • Enabled CLP_BUILD_TESTING for unit test builds.

@SharafMohamed SharafMohamed requested a review from a team as a code owner September 19, 2025 16:09
Copy link
Contributor

coderabbitai bot commented Sep 19, 2025

Walkthrough

Adds an interpretation-driven schema/wildcard workflow to GrepCore: normalizes query interpretations, identifies wildcard-encodable positions, and generates logtype strings; removes the ByteLexer-based get_bounds_of_next_potential_var API. Integrates schema-based sub-query generation into process_raw_query, exposes test hooks and fakes, and defaults a SubQuery wildcard flag.

Changes

Cohort / File(s) Summary
GrepCore schema/wildcard workflow
components/core/src/clp/GrepCore.cpp, components/core/src/clp/GrepCore.hpp
Replaces a ByteLexer-driven variable-bound parsing API with interpretation-driven functions. Adds normalize_interpretations, get_wildcard_encodable_positions, generate_logtype_string, templated generate_schema_sub_queries, and process_schema_var_token. Removes get_bounds_of_next_potential_var(..., ByteLexer&). Integrates schema-based sub-query generation into process_raw_query, and exposes test-only hooks under CLP_BUILD_TESTING.
Unit tests and test doubles
components/core/tests/test-GrepCore.cpp
Adds clp::GrepCoreTest test wrapper exposing new APIs, introduces fake dictionary/logtype entries (FakeVarDict, FakeLogTypeDict, entries) and numerous in-memory test helpers (builders for var/logtype dicts, interpretations, expected logtype strings, ByteLexer factory). Expands tests for encodability, logtype-string generation, variable processing, schema sub-query generation, and process_raw_query.
Query struct equality & init
components/core/src/clp/Query.hpp
Adds defaulted operator== for QueryVar and SubQuery. Initializes SubQuery::m_wildcard_match_required to false.
Build/test config
components/core/CMakeLists.txt
Defines CLP_BUILD_TESTING for the unitTest target via target_compile_definitions to enable test-only friend/access.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant GrepCore
  participant Parser as WildcardParser
  participant LTD as LogTypeDict
  participant VD as VariableDict

  Caller->>GrepCore: process_raw_query(raw_query, heuristics, dicts, ignore_case)
  GrepCore->>Parser: parse -> QueryInterpretations
  GrepCore->>GrepCore: normalize_interpretations(interpretations)
  alt schema-driven path
    loop per interpretation
      GrepCore->>GrepCore: get_wildcard_encodable_positions(interpretation)
      GrepCore->>GrepCore: generate_logtype_string(interpretation, wildcardPositions, mask)
      GrepCore->>LTD: lookup generated logtype string
      loop per variable token
        GrepCore->>VD: process_schema_var_token(token, var_dict, ignore_case, is_mask_encoded)
        VD-->>GrepCore: variable match(es)
      end
      GrepCore-->>Caller: emit SubQuery(s)
    end
  else legacy/heuristic path
    GrepCore-->>Caller: fallback processing (existing heuristic logic)
  end
  note right of GrepCore: process_schema_var_token handles encoded vs non-encoded tokens and wildcard masking
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly and concisely summarises the primary change, highlighting the switch to using log_surgeon::Query to improve schema-based search accuracy. It directly reflects the main objective of the pull request without extraneous detail. The phrasing is specific enough that a reviewer or team member can immediately understand the feature’s intent. It follows a conventional commit style while remaining readable and focused.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7dcc43 and ed13886.

📒 Files selected for processing (1)
  • components/core/src/clp/Query.hpp (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/Query.hpp
🧬 Code graph analysis (1)
components/core/src/clp/Query.hpp (1)
components/core/src/clp_s/search/ast/ColumnDescriptor.cpp (2)
  • rhs (105-110)
  • rhs (105-105)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: package-image
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (3)
components/core/src/clp/Query.hpp (3)

28-29: LGTM! Follows modern C++ best practices.

The defaulted equality operator is well-formed and appropriate. Using = default provides efficient memberwise comparison, and the parameter naming (rhs) correctly reflects that *this is the left-hand side.


80-81: LGTM! Appropriate for testing infrastructure.

The defaulted equality operator correctly compares all members, including computed state like m_ids_of_matching_segments. This is appropriate for the testing infrastructure mentioned in the PR objectives.


162-162: LGTM! Explicit initialization ensures consistent default state.

Explicitly initializing m_wildcard_match_required to false is good practice and aligns with the expectations of the schema-based processing logic.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b83e3cf and add11a2.

📒 Files selected for processing (6)
  • components/core/src/clp/GrepCore.cpp (2 hunks)
  • components/core/src/clp/GrepCore.hpp (5 hunks)
  • components/core/src/clp/ffi/ir_stream/search/test/test_QueryHandlerImpl.cpp (1 hunks)
  • components/core/src/clp/ffi/ir_stream/search/test/utils.cpp (1 hunks)
  • components/core/tests/test-GrepCore.cpp (1 hunks)
  • taskfiles/deps/main.yaml (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/ffi/ir_stream/search/test/test_QueryHandlerImpl.cpp
  • components/core/src/clp/ffi/ir_stream/search/test/utils.cpp
  • components/core/src/clp/GrepCore.cpp
  • components/core/tests/test-GrepCore.cpp
  • components/core/src/clp/GrepCore.hpp
🧠 Learnings (3)
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
PR: y-scope/clp#557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.

Applied to files:

  • components/core/src/clp/ffi/ir_stream/search/test/test_QueryHandlerImpl.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
PR: y-scope/clp#570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.

Applied to files:

  • components/core/src/clp/ffi/ir_stream/search/test/test_QueryHandlerImpl.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/src/clp/ffi/ir_stream/search/test/utils.cpp
🧬 Code graph analysis (3)
components/core/src/clp/GrepCore.cpp (2)
components/core/src/clp/GrepCore.hpp (2)
  • interpretation (193-195)
  • interpretation (207-210)
components/core/src/clp/EncodedVariableInterpreter.hpp (5)
  • add_dict_var (258-263)
  • add_dict_var (533-546)
  • add_dict_var (533-538)
  • encoded_var (52-52)
  • encoded_var (111-111)
components/core/tests/test-GrepCore.cpp (1)
components/core/src/clp/GrepCore.cpp (2)
  • get_bounds_of_next_potential_var (28-146)
  • get_bounds_of_next_potential_var (28-33)
components/core/src/clp/GrepCore.hpp (4)
components/core/src/clp/QueryToken.cpp (2)
  • is_var (137-145)
  • is_var (137-137)
components/core/src/clp/EncodedVariableInterpreter.hpp (13)
  • logtype (58-60)
  • logtype (58-58)
  • logtype (66-68)
  • logtype (66-66)
  • logtype (74-76)
  • logtype (74-74)
  • logtype (82-84)
  • logtype (82-82)
  • wildcard_search_dictionary_and_get_encoded_matches (218-223)
  • wildcard_search_dictionary_and_get_encoded_matches (486-511)
  • wildcard_search_dictionary_and_get_encoded_matches (486-491)
  • encoded_var (52-52)
  • encoded_var (111-111)
components/core/src/clp/GrepCore.cpp (4)
  • get_wildcard_encodable_positions (148-164)
  • get_wildcard_encodable_positions (148-149)
  • generate_logtype_string (166-233)
  • generate_logtype_string (166-169)
components/core/src/glt/Query.hpp (1)
  • logtype_entries (103-105)
🔇 Additional comments (13)
components/core/src/clp/ffi/ir_stream/search/test/utils.cpp (1)

15-15: LGTM! Required header for fmt::join usage.

The addition of <fmt/ranges.h> is necessary to enable the fmt::join function used on line 64. This aligns with the fmt 11.x upgrade.

components/core/src/clp/ffi/ir_stream/search/test/test_QueryHandlerImpl.cpp (1)

15-15: LGTM! Consistent header addition for fmt::join.

The addition of <fmt/ranges.h> enables fmt::join functionality on line 237, consistent with the fmt 11.x upgrade.

components/core/tests/test-GrepCore.cpp (2)

12-14: Good use of brace initialization for consistency.

The change to brace initialization enhances code consistency and prevents narrowing conversions.


79-99: Wildcard handling changes are intentional.

GrepCore documents a fallback to "decompression + wildcard matching" for tokens with greedy wildcards (components/core/src/clp/GrepCore.hpp ~lines 472–476), and the updated tests in components/core/tests/test-GrepCore.cpp (wildcard case starting ~line 75) reflect the DFA-based behaviour: escaped '' now delimits "1" while unescaped '' is preserved in "abc*123".

components/core/src/clp/GrepCore.hpp (3)

130-182: Well-documented schema sub-query generation logic.

The documentation clearly explains the complex logic for handling encodable wildcard variables and the binary choices for interpretation. The example effectively illustrates the 2^k possible combinations.


336-346: Verify thread-safety of static variables in GrepCore.hpp

interpretations_generated and interpretations (components/core/src/clp/GrepCore.hpp: lines 336–346) are static and can race under concurrent access. The verification script produced no output, so concurrency of this code path or existing synchronization could not be confirmed. Confirm whether this runs on multiple threads; if so, replace static state with per-call/per-archive state, use thread_local, or guard access with a mutex/atomic.


613-614: Verify SymbolId underlying type and make casts consistent.

SymbolId is defined in external log_surgeon — confirm its underlying integer type and align casts: GrepCore.hpp:613–614 and GrepCore.cpp:156–157,196–197 use static_cast<uint32_t>(TokenInt/TokenFloat) while Utils.cpp (≈140–156) uses static_cast(log_surgeon::SymbolId::TokenInt/TokenFloat). Match the cast to the enum's underlying type (or change var_type) to avoid subtle mismatches.

taskfiles/deps/main.yaml (3)

443-447: Approved — spdlog v1.15.3 is compatible with fmt v11.2.0.
Release notes show spdlog 1.15.3 bumped the bundled {fmt} to 11.2.0, so no changes required.


253-254: Verify fmt 11.2.0 compatibility across the codebase.

  • Static scan shows multiple fmt call sites; representative hits: components/core/src/clp_s/Utils.cpp:81, components/core/src/clp_s/kv_ir_search.cpp:175, components/core/src/clp_s/indexer/MySQLIndexStorage.cpp:31,58, components/core/tests/test-clp_s-search.cpp:117,215, components/core/src/clp_s/JsonConstructor.cpp:27.
  • ast-grep returned no exact matches for custom formatters (no template<> struct fmt::formatter<...> found).
  • Action: build the project with fmt 11.2.0 and run all unit/integration tests; if you get compiler or test failures, update affected call sites or add/update any custom formatters (also search for partial specializations or ADL-based formatters) and re-run tests.

320-321: Confirm log_surgeon update (a722d07) includes QueryInterpretation.
a722d07 tarball contains src/log_surgeon/wildcard_query_parser/QueryInterpretation.hpp and QueryInterpretation.cpp (also Query.hpp/Query.cpp and tests/test-query-interpretation.cpp), so the log_surgeon::wildcard_query_parser::QueryInterpretation referenced by components/core/src/clp/GrepCore.* is present.

components/core/src/clp/GrepCore.cpp (3)

18-26: LGTM: token/type aliases are clear and scoped

The using-declarations improve readability without polluting too broadly. No issues spotted.


213-230: LGTM: representable-int/float fast-path with dictionary fallback

The detection path is clear and orders checks correctly; encoded_var capture is fine as an out-param.


199-211: C++20 is already enforced for the target that builds this TU — no action required.

The clp target sets target_compile_features(... PRIVATE cxx_std_20) (components/core/src/clp/clp/CMakeLists.txt:172) and other CLP targets likewise use cxx_std_20, so std::unordered_map::contains is supported.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between add11a2 and c82dbab.

📒 Files selected for processing (3)
  • components/core/src/clp/GrepCore.hpp (5 hunks)
  • components/core/tests/test-GrepCore.cpp (1 hunks)
  • taskfiles/deps/main.yaml (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/tests/test-GrepCore.cpp
  • components/core/src/clp/GrepCore.hpp
🧬 Code graph analysis (2)
components/core/tests/test-GrepCore.cpp (1)
components/core/src/clp/GrepCore.cpp (2)
  • get_bounds_of_next_potential_var (28-146)
  • get_bounds_of_next_potential_var (28-33)
components/core/src/clp/GrepCore.hpp (3)
components/core/src/clp/DictionaryReader.hpp (1)
  • search_string (94-94)
components/core/src/clp/EncodedVariableInterpreter.hpp (13)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
  • wildcard_search_dictionary_and_get_encoded_matches (222-227)
  • wildcard_search_dictionary_and_get_encoded_matches (497-522)
  • wildcard_search_dictionary_and_get_encoded_matches (497-502)
  • encoded_var (57-57)
  • encoded_var (116-116)
components/core/src/clp/GrepCore.cpp (4)
  • get_wildcard_encodable_positions (148-164)
  • get_wildcard_encodable_positions (148-149)
  • generate_logtype_string (166-233)
  • generate_logtype_string (166-169)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (7)
components/core/tests/test-GrepCore.cpp (2)

13-16: LGTM: test initialisation simplified.

Brace‑initialising begin/end/is_var improves clarity; signature update aligns with the header.


21-101: LGTM: expectations preserved after signature change.

The call sites correctly drop the lexer argument and still validate tokenisation semantics.

If CI allows, add one integration test covering the new interpretation‑driven path in process_raw_query to ensure schema‑based subquery generation remains sound.

components/core/src/clp/GrepCore.hpp (2)

184-195: Fix typos in documentation.

“indicies” → “indices”; also remove the duplicated spacing.

-     * Scans the interpretation and returns the indicies of all encodable     * wildcard variables.
+     * Scans the interpretation and returns the indices of all encodable wildcard variables.
@@
-     * @return A vector of positions of encodabe wildcard variables.
+     * @return A vector of positions of encodable wildcard variables.

540-542: Replace magic number 32 with a named constant.

Carry forward the earlier feedback and make the limit self-documenting.

-        if (wildcard_encodable_positions.size() > 32) {
+        if (wildcard_encodable_positions.size() > cMaxEncodableWildcardVariables) {
             throw std::runtime_error("Too many encodable variables.");
         }

Add near the class’ private section (example placement shown):

// Place inside class GrepCore, private: section
static constexpr size_t cMaxEncodableWildcardVariables{32};
taskfiles/deps/main.yaml (3)

443-448: spdlog–fmt ABI compatibility — verify in CI

v1.15.3 bundles fmt 11.2.0 and is compatible with fmt 11.x; run a minimal CI build compiling with -DSPDLOG_FMT_EXTERNAL=ON against fmt 11.2.0 (or the latest 11.x) to confirm no platform/toolchain-specific issues.


253-255: SHA256 verified — run CI to confirm build compatibility

  • taskfiles/deps/main.yaml (lines 253–255): TARBALL_SHA256 matches the downloaded fmt 11.2.0 tarball: bc23066d87ab3168f27cef3e97d545fa63314f5c79df5ea444d41d56f962c6af.
  • Repository shows widespread fmt usage (find_package(fmt REQUIRED), fmt::fmt target, many fmt::format/format_to/formatter/includes).
  • Cannot compile here; run CI or a local full build to ensure all consumers compile cleanly with fmt 11.2.0.

320-322: APIs present in commit a722d07 — OK to pin.

Query::get_all_multi_token_interpretations is in src/log_surgeon/wildcard_query_parser/Query.hpp; QueryInterpretation, VariableQueryToken and StaticQueryToken headers are present under src/log_surgeon/wildcard_query_parser/; SymbolId enum is in src/log_surgeon/Constants.hpp. No missing headers or symbols found.

Comment on lines +630 to +644
encoded_variable_t encoded_var{};
if ((is_int
&& EncodedVariableInterpreter::convert_string_to_representable_integer_var(
raw_string,
encoded_var
))
|| (is_float
&& EncodedVariableInterpreter::convert_string_to_representable_float_var(
raw_string,
encoded_var
)))
{
sub_query.add_non_dict_var(encoded_var);
return true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this handle when the EncodedVariableInterpreter fails? It just seems to continue, is that right?

Copy link
Contributor Author

@SharafMohamed SharafMohamed Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, continuing is the desired behavior.

The idea is to treat it as a dictionary variable if it fails. In the heuristic case these methods succeeding is exclusively what tells you its an integer/float.

In the schema case we can first check if the schema variable type matches, but we still need to check if the int/float is encodable (i.e., its not too long or the user didn't name some malformed regex int/float).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c82dbab and 9d3e152.

📒 Files selected for processing (4)
  • components/core/src/clp/GrepCore.cpp (2 hunks)
  • components/core/src/clp/GrepCore.hpp (6 hunks)
  • components/core/src/clp/Query.hpp (1 hunks)
  • components/core/tests/test-GrepCore.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/Query.hpp
  • components/core/src/clp/GrepCore.hpp
  • components/core/tests/test-GrepCore.cpp
  • components/core/src/clp/GrepCore.cpp
🧠 Learnings (1)
📚 Learning: 2025-09-11T13:15:39.930Z
Learnt from: anlowee
PR: y-scope/clp#1176
File: components/core/src/clp_s/ColumnWriter.cpp:67-72
Timestamp: 2025-09-11T13:15:39.930Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, when using sizeof with types like variable_dictionary_id_t, always use the fully qualified namespace (clp::variable_dictionary_id_t) to match the variable declaration and maintain consistency with other column writers in the codebase.

Applied to files:

  • components/core/src/clp/GrepCore.cpp
🧬 Code graph analysis (3)
components/core/src/clp/GrepCore.hpp (3)
components/core/tests/test-GrepCore.cpp (10)
  • interpretation (48-51)
  • interpretation (48-49)
  • interpretation (53-58)
  • interpretation (53-56)
  • logtype (207-216)
  • logtype (207-208)
  • logtype (218-228)
  • logtype (218-222)
  • generate_schema_sub_queries (81-94)
  • generate_schema_sub_queries (81-86)
components/core/src/clp/EncodedVariableInterpreter.hpp (13)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
  • wildcard_search_dictionary_and_get_encoded_matches (222-227)
  • wildcard_search_dictionary_and_get_encoded_matches (497-522)
  • wildcard_search_dictionary_and_get_encoded_matches (497-502)
  • encoded_var (57-57)
  • encoded_var (116-116)
components/core/src/clp/GrepCore.cpp (4)
  • get_wildcard_encodable_positions (148-164)
  • get_wildcard_encodable_positions (148-149)
  • generate_logtype_string (166-233)
  • generate_logtype_string (166-169)
components/core/tests/test-GrepCore.cpp (4)
components/core/src/clp/GrepCore.hpp (9)
  • interpretation (197-199)
  • interpretation (211-214)
  • process_schema_var_token (235-241)
  • process_schema_var_token (606-668)
  • process_schema_var_token (606-612)
  • generate_schema_sub_queries (178-185)
  • generate_schema_sub_queries (535-603)
  • generate_schema_sub_queries (535-541)
  • value (75-80)
components/core/src/clp/GrepCore.cpp (6)
  • get_wildcard_encodable_positions (148-164)
  • get_wildcard_encodable_positions (148-149)
  • generate_logtype_string (166-233)
  • generate_logtype_string (166-169)
  • get_bounds_of_next_potential_var (28-146)
  • get_bounds_of_next_potential_var (28-33)
components/core/src/clp/EncodedVariableInterpreter.hpp (14)
  • id (56-56)
  • value (97-100)
  • value (107-110)
  • add_dict_var (266-271)
  • add_dict_var (548-561)
  • add_dict_var (548-553)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
components/core/src/clp/Query.cpp (2)
  • add_dict_var (102-105)
  • add_dict_var (103-103)
components/core/src/clp/GrepCore.cpp (2)
components/core/src/clp/GrepCore.hpp (2)
  • interpretation (197-199)
  • interpretation (211-214)
components/core/tests/test-GrepCore.cpp (7)
  • interpretation (48-51)
  • interpretation (48-49)
  • interpretation (53-58)
  • interpretation (53-56)
  • add_int_var (175-175)
  • add_float_var (177-177)
  • add_dict_var (179-179)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: package-image
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (15)
components/core/src/clp/Query.hpp (1)

158-159: Deterministic default for wildcard flag — good.

Default-initializing m_wildcard_match_required to false prevents UB from uninitialized reads. Please confirm SubQuery::clear() also resets this flag to false to maintain invariants across reuse.

components/core/tests/test-GrepCore.cpp (1)

590-718: Solid coverage for schema subquery generation.

Good end-to-end assertions on wildcard-encodable combinations, variable shapes, and logtype IDs.

components/core/src/clp/GrepCore.hpp (8)

9-14: Add missing include for std::set.

This header declares std::set in public APIs; include to avoid relying on transitive includes.

Apply:

 #include <string_view>
+#include <set>
 #include <unordered_map>
 #include <unordered_set>
 #include <vector>

188-196: Typos in documentation.

Fix “indicies” → “indices” and “encodabe” → “encodable”.

-     * Scans the interpretation and returns the indicies of all encodable wildcard variables.
+     * Scans the interpretation and returns the indices of all encodable wildcard variables.
...
-     * @return A vector of positions of encodabe wildcard variables.
+     * @return A vector of positions of encodable wildcard variables.

207-213: Typo in documentation.

“interpetation” → “interpretation”.

-     * @param interpretation The interpetation to convert to a logtype string.
+     * @param interpretation The interpretation to convert to a logtype string.

288-294: Follow project style: use false == expr instead of !expr.

Aligns with coding guidelines.

As per coding guidelines

-            if (!query_token.has_greedy_wildcard_in_middle() && query_token.is_ambiguous_token()) {
+            if (false == query_token.has_greedy_wildcard_in_middle()
+                && query_token.is_ambiguous_token())
+            {

342-352: Remove thread-unsafe statics; compute interpretations per call (or add a proper cache).

Function-local statics here are shared across calls and archives, race-prone, and nullified by cExecuteForEveryArchive. Construct Query and get interpretations per call.

-        static bool interpretations_generated{false};
-        static std::set<log_surgeon::wildcard_query_parser::QueryInterpretation> interpretations;
-
-        // TODO: This needs to be done for every archive until we have per schema logic.
-        constexpr bool cExecuteForEveryArchive{true};
-        if (cExecuteForEveryArchive || false == interpretations_generated) {
-            log_surgeon::wildcard_query_parser::Query const query(search_string);
-            interpretations.clear();
-            interpretations = query.get_all_multi_token_interpretations(lexer);
-            interpretations_generated = true;
-        }
+        log_surgeon::wildcard_query_parser::Query const query(search_string);
+        auto interpretations = query.get_all_multi_token_interpretations(lexer);

If caching is required, introduce a schema-keyed cache with synchronisation.


353-356: Fix comment typo “Transfrom”.

-        // Transfrom log-surgeon interpretations into CLP sub-queries.
+        // Transform log-surgeon interpretations into CLP sub-queries.

543-551: Avoid undefined/implementation-defined shifts; extract magic number.

  • size_t shifting by variable bit width is UB on 32-bit; use uint64_t and guard.
  • Replace magic 32 with a named constant.
-        if (wildcard_encodable_positions.size() > 32) {
-            throw std::runtime_error("Too many encodable variables.");
-        }
-        size_t const num_combos{static_cast<size_t>(1) << wildcard_encodable_positions.size()};
-        for (size_t mask{0}; mask < num_combos; ++mask) {
+        constexpr size_t cMaxEncodableWildcardVariables{32};
+        auto const k = wildcard_encodable_positions.size();
+        if (k > cMaxEncodableWildcardVariables) {
+            throw std::runtime_error("Too many encodable variables.");
+        }
+        if (k >= 64) {
+            throw std::runtime_error("Wildcard encodable variable count exceeds mask width.");
+        }
+        uint64_t const num_combos{1ULL << k};
+        for (uint64_t mask{0}; mask < num_combos; ++mask) {

549-552: Reduce per-iteration allocations.

Building an unordered_map per mask is expensive. Use a flat container (e.g., vector<uint8_t>) keyed by local index, or reuse a pre-sized map.

Example approach:

  • Precompute vector<size_t> positions = wildcard_encodable_positions.
  • For each mask, fill vector<uint8_t> encoded(k) with (mask >> i) & 1 and consult positions[i] inside generate_logtype_string via a small adapter.
components/core/src/clp/GrepCore.cpp (5)

6-7: Include for std::holds_alternative/std::get.

Avoid relying on transitive includes.

 #include <string_view>
+#include <variant>
 #include <unordered_map>
 #include <vector>

155-158: Prefer strongly typed enum comparisons.

Compare as SymbolId, not via uint32_t.

-    auto const var_type{variable_token.get_variable_type()};
-    bool const is_int{static_cast<uint32_t>(TokenInt) == var_type};
-    bool const is_float{static_cast<uint32_t>(TokenFloat) == var_type};
+    auto const var_symbol = static_cast<decltype(TokenInt)>(variable_token.get_variable_type());
+    bool const is_int   = (var_symbol == TokenInt);
+    bool const is_float = (var_symbol == TokenFloat);

172-181: Reserve logic assumes one byte per placeholder.

If EncodedVariableInterpreter ever appends more than one byte, this under-reserves. Consider exposing/using a constexpr placeholder size or asserting the invariant.


184-185: Fix comment typo.

“Generete” → “Generate”.

-    // Generete `logtype_string`.
+    // Generate `logtype_string`.

196-198: Prefer strongly typed enum comparisons.

Repeat: compare as SymbolId, not via uint32_t.

-    bool const is_int{static_cast<uint32_t>(TokenInt) == var_type};
-    bool const is_float{static_cast<uint32_t>(TokenFloat) == var_type};
+    bool const is_int   = (var_type == TokenInt);
+    bool const is_float = (var_type == TokenFloat);

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d3e152 and b587085.

📒 Files selected for processing (1)
  • components/core/src/clp/GrepCore.cpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/GrepCore.cpp
🧠 Learnings (1)
📚 Learning: 2025-09-11T13:15:39.930Z
Learnt from: anlowee
PR: y-scope/clp#1176
File: components/core/src/clp_s/ColumnWriter.cpp:67-72
Timestamp: 2025-09-11T13:15:39.930Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, when using sizeof with types like variable_dictionary_id_t, always use the fully qualified namespace (clp::variable_dictionary_id_t) to match the variable declaration and maintain consistency with other column writers in the codebase.

Applied to files:

  • components/core/src/clp/GrepCore.cpp
🧬 Code graph analysis (1)
components/core/src/clp/GrepCore.cpp (3)
components/core/src/clp/GrepCore.hpp (2)
  • interpretation (197-199)
  • interpretation (211-214)
components/core/tests/test-GrepCore.cpp (7)
  • interpretation (48-51)
  • interpretation (48-49)
  • interpretation (53-58)
  • interpretation (53-56)
  • add_int_var (175-175)
  • add_float_var (177-177)
  • add_dict_var (179-179)
components/core/src/clp/EncodedVariableInterpreter.hpp (5)
  • add_dict_var (266-271)
  • add_dict_var (548-561)
  • add_dict_var (548-553)
  • encoded_var (57-57)
  • encoded_var (116-116)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: package-image
🔇 Additional comments (5)
components/core/src/clp/GrepCore.cpp (5)

19-23: LGTM: explicit SymbolId enumerators.

Bringing TokenInt/TokenFloat into scope is clear and consistent with usage below.


25-26: LGTM: local aliases for STL types.

Keeps signatures succinct without hiding ownership semantics.


173-183: Preallocation assumes add_*_var appends one character.

If add_int_var/add_float_var/add_dict_var ever append more than one byte, this under-reserves. Either expose constexpr sizes or assert the invariant.

Can you confirm each add_*_var appends exactly one byte today?


185-185: Fix typo in comment.

“Generete” -> “Generate”.

-    // Generete `logtype_string`.
+    // Generate `logtype_string`.

167-232: Cache interpretation.get_logtype(), iterate by const& and replace contains+at with a single find()
Removes double map lookup and drops the C++20 unordered_map::contains dependency; verify your CMakeLists.txt C++ standard to ensure C++17 compatibility.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b587085 and d7cfb0d.

📒 Files selected for processing (2)
  • components/core/src/clp/GrepCore.cpp (2 hunks)
  • components/core/src/clp/GrepCore.hpp (6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/GrepCore.hpp
  • components/core/src/clp/GrepCore.cpp
🧠 Learnings (1)
📚 Learning: 2025-09-11T13:15:39.930Z
Learnt from: anlowee
PR: y-scope/clp#1176
File: components/core/src/clp_s/ColumnWriter.cpp:67-72
Timestamp: 2025-09-11T13:15:39.930Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, when using sizeof with types like variable_dictionary_id_t, always use the fully qualified namespace (clp::variable_dictionary_id_t) to match the variable declaration and maintain consistency with other column writers in the codebase.

Applied to files:

  • components/core/src/clp/GrepCore.hpp
  • components/core/src/clp/GrepCore.cpp
🧬 Code graph analysis (2)
components/core/src/clp/GrepCore.hpp (3)
components/core/tests/test-GrepCore.cpp (10)
  • interpretation (48-51)
  • interpretation (48-49)
  • interpretation (53-58)
  • interpretation (53-56)
  • logtype (207-216)
  • logtype (207-208)
  • logtype (218-228)
  • logtype (218-222)
  • generate_schema_sub_queries (81-94)
  • generate_schema_sub_queries (81-86)
components/core/src/clp/EncodedVariableInterpreter.hpp (13)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
  • wildcard_search_dictionary_and_get_encoded_matches (222-227)
  • wildcard_search_dictionary_and_get_encoded_matches (497-522)
  • wildcard_search_dictionary_and_get_encoded_matches (497-502)
  • encoded_var (57-57)
  • encoded_var (116-116)
components/core/src/clp/GrepCore.cpp (4)
  • get_wildcard_encodable_positions (149-165)
  • get_wildcard_encodable_positions (149-150)
  • generate_logtype_string (167-232)
  • generate_logtype_string (167-170)
components/core/src/clp/GrepCore.cpp (3)
components/core/src/clp/GrepCore.hpp (2)
  • interpretation (198-200)
  • interpretation (212-215)
components/core/tests/test-GrepCore.cpp (5)
  • interpretation (48-51)
  • interpretation (48-49)
  • interpretation (53-58)
  • interpretation (53-56)
  • add_dict_var (179-179)
components/core/src/clp/EncodedVariableInterpreter.hpp (5)
  • add_dict_var (266-271)
  • add_dict_var (548-561)
  • add_dict_var (548-553)
  • encoded_var (57-57)
  • encoded_var (116-116)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
🔇 Additional comments (13)
components/core/src/clp/GrepCore.hpp (10)

7-7: Good fix: add missing include.

Prevents reliance on transitive includes.


212-215: LGTM: clear API for logtype-string generation.


235-243: LGTM: helper API surface reads well.


293-297: LGTM: project style honoured (false == …).


624-627: Confirm intent: encoded-wildcard path adds no var constraint.

Early-return marks wildcard_match_required without adding an encoded var. If that’s intentional (accept any encoded value and defer to message-level matching), fine; otherwise, add a non-dict var placeholder for clarity/consistency.


28-32: Gate test-only friend behind a build flag.

Expose the friend only in test builds to avoid leaking test details in public headers.

- class GrepCoreTest;
+ #ifdef CLP_ENABLE_TESTS
+ class GrepCoreTest;
+ #endif
...
 class GrepCore {
-    friend class GrepCoreTest;
+  #ifdef CLP_ENABLE_TESTS
+    friend class GrepCoreTest;
+  #endif

345-355: Remove unsafe function-local statics; compute interpretations per call.

Statics here are thread-unsafe and nullified by cExecuteForEveryArchive{true}. Build the interpretations locally.

-        static bool interpretations_generated{false};
-        static std::set<log_surgeon::wildcard_query_parser::QueryInterpretation> interpretations;
-
-        // TODO: This needs to be done for every archive until we have per schema logic.
-        constexpr bool cExecuteForEveryArchive{true};
-        if (cExecuteForEveryArchive || false == interpretations_generated) {
-            log_surgeon::wildcard_query_parser::Query const query(search_string);
-            interpretations.clear();
-            interpretations = query.get_all_multi_token_interpretations(lexer);
-            interpretations_generated = true;
-        }
+        log_surgeon::wildcard_query_parser::Query const query(search_string);
+        auto interpretations = query.get_all_multi_token_interpretations(lexer);

356-356: Fix typo: “Transfrom” → “Transform”.

-        // Transfrom log-surgeon interpretations into CLP sub-queries.
+        // Transform log-surgeon interpretations into CLP sub-queries.

545-556: Fix undefined/implementation-defined shifts; use 64-bit masks and guard k.

Shifting size_t by its bit width is UB on 32-bit. Also 2^32 combos is infeasible.

-        size_t const num_combos{static_cast<size_t>(1) << wildcard_encodable_positions.size()};
-        for (size_t mask{0}; mask < num_combos; ++mask) {
+        auto const k = wildcard_encodable_positions.size();
+        if (k >= 64) {
+            throw std::runtime_error("Too many encodable variables for mask enumeration (k >= 64).");
+        }
+        uint64_t const num_combos{1ULL << k};
+        for (uint64_t mask{0}; mask < num_combos; ++mask) {
             std::unordered_map<size_t, bool> wildcard_mask_map;
-            for (size_t i{0}; i < wildcard_encodable_positions.size(); ++i) {
-                wildcard_mask_map[wildcard_encodable_positions[i]] = mask >> i & 1;
+            for (size_t i{0}; i < k; ++i) {
+                wildcard_mask_map[wildcard_encodable_positions[i]] = (mask >> i) & 1ULL;
             }

Additionally, consider lowering the practical cap from 32 to something manageable (e.g., 20) to avoid 2^k explosion.


609-623: Use strongly typed enum comparisons for SymbolId.

Avoid uint32_t-based comparisons; compare as SymbolId.

-    auto const var_type{variable_token.get_variable_type()};
-    bool const is_int{static_cast<uint32_t>(log_surgeon::SymbolId::TokenInt) == var_type};
-    bool const is_float{static_cast<uint32_t>(log_surgeon::SymbolId::TokenFloat) == var_type};
+    auto const var_symbol = static_cast<log_surgeon::SymbolId>(variable_token.get_variable_type());
+    bool const is_int   = (var_symbol == log_surgeon::SymbolId::TokenInt);
+    bool const is_float = (var_symbol == log_surgeon::SymbolId::TokenFloat);
components/core/src/clp/GrepCore.cpp (3)

6-8: LGTM: explicit STL includes (unordered_map, variant, vector).


19-23: LGTM: using-declarations improve readability and locality.


149-165: Avoid variant copies and repeated get_logtype() calls.

Cache the logtype and iterate by const reference; reserve capacity.

-auto GrepCore::get_wildcard_encodable_positions(QueryInterpretation const& interpretation)
-        -> vector<size_t> {
-    vector<size_t> wildcard_encodable_positions;
-    for (size_t i{0}; i < interpretation.get_logtype().size(); ++i) {
-        auto const token{interpretation.get_logtype()[i]};
+auto GrepCore::get_wildcard_encodable_positions(QueryInterpretation const& interpretation)
+        -> vector<size_t> {
+    auto const& logtype = interpretation.get_logtype();
+    vector<size_t> wildcard_encodable_positions;
+    wildcard_encodable_positions.reserve(logtype.size());
+    for (size_t i{0}; i < logtype.size(); ++i) {
+        auto const& token = logtype[i];

…esting code for DP algo; Add final unit-test for testing PR end-to-end; Still lots of edge cases and sub-steps that could be tested more rigorously.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (6)
components/core/src/clp/GrepCore.hpp (5)

191-197: Fix typos in docs.

“than” → “that”; “encodabe” → “encodable”.

-     * An encodable variable is a variable token than:
+     * An encodable variable is a variable token that:
@@
-     * @return A vector of positions of encodabe wildcard variables.
+     * @return A vector of positions of encodable wildcard variables.

28-32: Gate test‑only friend with a build flag.

Avoid leaking test details to consumers; wrap the friend in CLP_BUILD_TESTING (or your project’s test macro).

-class GrepCoreTest;
+ #if defined(CLP_BUILD_TESTING)
+ class GrepCoreTest;
+ #endif

 class GrepCore {
-    friend class GrepCoreTest;
+  #if defined(CLP_BUILD_TESTING)
+    friend class GrepCoreTest;
+  #endif

544-547: Avoid per‑iteration unordered_map allocations for the mask.

This map is rebuilt for every mask; switch to a flat mask (vector<uint8_t>/bitset) or add an overload of generate_logtype_string that takes (positions, mask).

-            std::unordered_map<size_t, bool> wildcard_mask_map;
-            for (size_t i{0}; i < wildcard_encodable_positions.size(); ++i) {
-                wildcard_mask_map[wildcard_encodable_positions[i]] = mask >> i & 1ULL;
-            }
+            // Lightweight mask: local index -> bit
+            std::vector<uint8_t> mask_bits(wildcard_encodable_positions.size());
+            for (size_t i{0}; i < wildcard_encodable_positions.size(); ++i) {
+                mask_bits[i] = static_cast<uint8_t>((mask >> i) & 1ULL);
+            }
+            // Prefer an overload: generate_logtype_string(interpretation, wildcard_encodable_positions, mask_bits)

If changing signature isn’t feasible now, at least reserve:

+            wildcard_mask_map.reserve(wildcard_encodable_positions.size());

536-543: Cap combinatorial explosion (2^k) to a practical limit.

k=32 allows 4,294,967,296 combinations; this can hang a search. Lower the cap (e.g., 16) and fail fast.

-    constexpr size_t cMaxEncodableWildcardVariables{32};
+    constexpr size_t cMaxEncodableWildcardVariables{16}; // practical cap to bound 2^k work
@@
-        uint64_t const num_combos{1ULL << wildcard_encodable_positions.size()};
+        auto const k = wildcard_encodable_positions.size();
+        uint64_t const num_combos{1ULL << k};

Optionally add telemetry to surface when the cap is hit.


612-614: Use strongly‑typed enum comparisons for symbol IDs.

Avoid underlying‑integer compares; cast once to SymbolId and compare enums.

-    auto const var_type{variable_token.get_variable_type()};
-
-    bool const is_int{static_cast<uint32_t>(log_surgeon::SymbolId::TokenInt) == var_type};
-    bool const is_float{static_cast<uint32_t>(log_surgeon::SymbolId::TokenFloat) == var_type};
+    auto const var_symbol{
+        static_cast<log_surgeon::SymbolId>(variable_token.get_variable_type())
+    };
+    bool const is_int  = (var_symbol == log_surgeon::SymbolId::TokenInt);
+    bool const is_float= (var_symbol == log_surgeon::SymbolId::TokenFloat);
components/core/tests/test-GrepCore.cpp (1)

493-500: Fix mask width and bit extraction.

Shift on size_t with literal 1 risks UB on 32‑bit; also mask bit should use 1ULL.

-    size_t const num_combos{static_cast<size_t>(1) << wildcard_encodable_positions.size()};
+    uint64_t const num_combos{1ULL << wildcard_encodable_positions.size()};
     REQUIRE(num_combos == 4);
     unordered_set<string> logtype_strings;
-    for (size_t mask{0}; mask < num_combos; ++mask) {
+    for (uint64_t mask{0}; mask < num_combos; ++mask) {
         unordered_map<size_t, bool> wildcard_mask_map;
         for (size_t i{0}; i < wildcard_encodable_positions.size(); ++i) {
-            wildcard_mask_map[wildcard_encodable_positions[i]] = mask >> i & 1;
+            wildcard_mask_map[wildcard_encodable_positions[i]] = (mask >> i) & 1ULL;
         }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d7cfb0d and b633980.

📒 Files selected for processing (4)
  • components/core/src/clp/GrepCore.hpp (6 hunks)
  • components/core/src/clp/string_utils/string_utils.cpp (3 hunks)
  • components/core/tests/test-GrepCore.cpp (1 hunks)
  • components/core/tests/test-string_utils.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/tests/test-string_utils.cpp
  • components/core/src/clp/GrepCore.hpp
  • components/core/src/clp/string_utils/string_utils.cpp
  • components/core/tests/test-GrepCore.cpp
🧠 Learnings (1)
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/core/src/clp/GrepCore.hpp
🧬 Code graph analysis (3)
components/core/tests/test-string_utils.cpp (2)
components/core/src/clp/string_utils/string_utils.cpp (2)
  • wildcard_match_unsafe_case_sensitive (238-342)
  • wildcard_match_unsafe_case_sensitive (238-238)
components/core/src/clp/string_utils/string_utils.hpp (1)
  • wildcard_match_unsafe_case_sensitive (142-142)
components/core/src/clp/GrepCore.hpp (3)
components/core/tests/test-GrepCore.cpp (10)
  • interpretation (55-58)
  • interpretation (55-56)
  • interpretation (60-65)
  • interpretation (60-63)
  • logtype (220-230)
  • logtype (221-222)
  • logtype (232-242)
  • logtype (232-236)
  • generate_schema_sub_queries (88-101)
  • generate_schema_sub_queries (88-93)
components/core/src/clp/EncodedVariableInterpreter.hpp (13)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
  • wildcard_search_dictionary_and_get_encoded_matches (222-227)
  • wildcard_search_dictionary_and_get_encoded_matches (497-522)
  • wildcard_search_dictionary_and_get_encoded_matches (497-502)
  • encoded_var (57-57)
  • encoded_var (116-116)
components/core/src/clp/GrepCore.cpp (4)
  • get_wildcard_encodable_positions (149-165)
  • get_wildcard_encodable_positions (149-150)
  • generate_logtype_string (167-232)
  • generate_logtype_string (167-170)
components/core/tests/test-GrepCore.cpp (6)
components/core/src/clp/string_utils/string_utils.cpp (2)
  • wildcard_match_unsafe_case_sensitive (238-342)
  • wildcard_match_unsafe_case_sensitive (238-238)
components/core/src/clp/GrepCore.hpp (12)
  • interpretation (198-200)
  • interpretation (212-215)
  • process_schema_var_token (236-242)
  • process_schema_var_token (601-663)
  • process_schema_var_token (601-607)
  • generate_schema_sub_queries (179-186)
  • generate_schema_sub_queries (529-598)
  • generate_schema_sub_queries (529-535)
  • value (76-81)
  • process_raw_query (56-65)
  • process_raw_query (248-368)
  • process_raw_query (248-257)
components/core/src/clp/GrepCore.cpp (6)
  • get_wildcard_encodable_positions (149-165)
  • get_wildcard_encodable_positions (149-150)
  • generate_logtype_string (167-232)
  • generate_logtype_string (167-170)
  • get_bounds_of_next_potential_var (29-147)
  • get_bounds_of_next_potential_var (29-34)
components/core/src/clp/EncodedVariableInterpreter.hpp (14)
  • id (56-56)
  • value (97-100)
  • value (107-110)
  • add_dict_var (266-271)
  • add_dict_var (548-561)
  • add_dict_var (548-553)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
components/core/src/clp/Query.cpp (2)
  • add_dict_var (102-105)
  • add_dict_var (103-103)
components/core/src/clp/QueryToken.cpp (2)
  • is_var (137-145)
  • is_var (137-137)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: package-image
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: rust-checks (macos-15)
  • GitHub Check: build-macos (macos-15, false)
🔇 Additional comments (4)
components/core/src/clp/string_utils/string_utils.cpp (3)

17-18: LGTM! Clear precondition documentation.

The added comment clarifies the assumption that wild_current has no duplicate greedy wildcards, which aligns with the implementation on line 41-42.


263-269: LGTM! Improved consecutive wildcard handling.

The while loop correctly consumes all consecutive '*' characters and efficiently handles the trailing wildcard case with an early return.


316-319: LGTM! Correct trailing wildcard handling.

The loop correctly consumes all trailing '*' characters in wild when tame ends, and the final comparison properly determines if the match is successful.

components/core/tests/test-string_utils.cpp (1)

148-151: Good additions: double-asterisk coverage is solid.

These cases mirror single-asterisk variants and match the matcher’s semantics for consecutive '*'. No changes requested.

Also applies to: 158-161, 168-171, 178-181, 188-191, 198-201

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
components/core/src/clp/GrepCore.hpp (2)

616-618: Compare SymbolId values strongly

This still relies on uint32_t casts; please cast var_type back to SymbolId and compare strongly typed, matching the .cpp implementation. Doing so removes the magic numbers and avoids future breakage if the enum underlying type changes.

-    auto const var_type{variable_token.get_variable_type()};
-
-    bool const is_int{static_cast<uint32_t>(log_surgeon::SymbolId::TokenInt) == var_type};
-    bool const is_float{static_cast<uint32_t>(log_surgeon::SymbolId::TokenFloat) == var_type};
+    auto const var_type{
+            static_cast<log_surgeon::SymbolId>(variable_token.get_variable_type())
+    };
+
+    bool const is_int{log_surgeon::SymbolId::TokenInt == var_type};
+    bool const is_float{log_surgeon::SymbolId::TokenFloat == var_type};

195-201: Fix typos in the new Doxygen

Please correct “than” → “that” and “encodabe” → “encodable” so the comment reads cleanly.

-     * An encodable variable is a variable token than:
+     * An encodable variable is a variable token that:
@@
-     * @return A vector of positions of encodabe wildcard variables.
+     * @return A vector of positions of encodable wildcard variables.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b633980 and 62f23b0.

📒 Files selected for processing (2)
  • components/core/CMakeLists.txt (1 hunks)
  • components/core/src/clp/GrepCore.hpp (6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/GrepCore.hpp
🧠 Learnings (1)
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/core/src/clp/GrepCore.hpp
🧬 Code graph analysis (1)
components/core/src/clp/GrepCore.hpp (3)
components/core/tests/test-GrepCore.cpp (10)
  • interpretation (55-58)
  • interpretation (55-56)
  • interpretation (60-65)
  • interpretation (60-63)
  • logtype (220-230)
  • logtype (221-222)
  • logtype (232-242)
  • logtype (232-236)
  • generate_schema_sub_queries (88-101)
  • generate_schema_sub_queries (88-93)
components/core/src/clp/EncodedVariableInterpreter.hpp (13)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
  • wildcard_search_dictionary_and_get_encoded_matches (222-227)
  • wildcard_search_dictionary_and_get_encoded_matches (497-522)
  • wildcard_search_dictionary_and_get_encoded_matches (497-502)
  • encoded_var (57-57)
  • encoded_var (116-116)
components/core/src/clp/GrepCore.cpp (4)
  • get_wildcard_encodable_positions (149-165)
  • get_wildcard_encodable_positions (149-150)
  • generate_logtype_string (167-232)
  • generate_logtype_string (167-170)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: rust-checks (macos-15)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (1)
components/core/CMakeLists.txt (1)

774-776: Scoped test macro wiring looks good

Defining CLP_BUILD_TESTING only for the unitTest target is exactly what the header guard expects, so production builds stay untouched. Nicely done.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
components/core/src/clp/GrepCore.hpp (1)

141-190: Minor typo in documentation.

Line 151: "We represents" should be "We represent".

-     *    If there are k encodable wildcard variables, then 2^k logtype strings are possible. As a
-     *    result we limit k <= 16. We represents these alternatives using a bit mask.
+     *    If there are k encodable wildcard variables, then 2^k logtype strings are possible. As a
+     *    result we limit k <= 16. We represent these alternatives using a bit mask.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7352fd2 and decca7e.

📒 Files selected for processing (1)
  • components/core/src/clp/GrepCore.hpp (6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/GrepCore.hpp
🧠 Learnings (1)
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/core/src/clp/GrepCore.hpp
🧬 Code graph analysis (1)
components/core/src/clp/GrepCore.hpp (3)
components/core/tests/test-GrepCore.cpp (10)
  • interpretation (59-62)
  • interpretation (59-60)
  • interpretation (64-69)
  • interpretation (64-67)
  • logtype (224-234)
  • logtype (225-226)
  • logtype (236-246)
  • logtype (236-240)
  • generate_schema_sub_queries (92-105)
  • generate_schema_sub_queries (92-97)
components/core/src/clp/EncodedVariableInterpreter.hpp (13)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
  • wildcard_search_dictionary_and_get_encoded_matches (222-227)
  • wildcard_search_dictionary_and_get_encoded_matches (497-522)
  • wildcard_search_dictionary_and_get_encoded_matches (497-502)
  • encoded_var (57-57)
  • encoded_var (116-116)
components/core/src/clp/GrepCore.cpp (4)
  • get_wildcard_encodable_positions (150-169)
  • get_wildcard_encodable_positions (150-151)
  • generate_logtype_string (171-238)
  • generate_logtype_string (171-174)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: package-image
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (macos-15)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (4)
components/core/src/clp/GrepCore.hpp (4)

7-14: LGTM!

The added includes (<set>, <unordered_map>) and the updated log_surgeon dependency (Query.hpp) are necessary for the new interpretation-based search functionality.


348-359: LGTM!

The interpretation-based path correctly constructs log_surgeon::wildcard_query_parser::Query and generates schema sub-queries. The TODO comment on line 349 appropriately flags the caching optimization opportunity for future work.


540-605: LGTM!

The implementation correctly:

  • Caps encodable wildcards at 16 to prevent exponential explosion (line 540, 543-545)
  • Uses uint64_t for mask operations to avoid undefined behaviour (lines 546-547)
  • Employs find() for single map lookup (lines 575-578)
  • Reserves capacity for possible_logtype_ids (line 597)
  • Follows project style with false == expr comparisons (lines 588, 592)

607-670: LGTM!

The implementation correctly:

  • Uses strongly typed enum comparisons (lines 617-619)
  • Handles wildcard-masked encoded variables (lines 621-624)
  • Falls back to dictionary search for wildcard variables (lines 626-633)
  • Attempts encoding for int/float types before dictionary lookup (lines 635-649)
  • Handles both single and multiple dictionary entries efficiently (lines 651-669)
  • Reserves capacity for both encoded_vars and var_dict_ids (lines 662-663)

@SharafMohamed SharafMohamed changed the title Improve schema-based search accuracy using log_surgeon::Query. feat: Improve schema-based search accuracy using log_surgeon::Query. Oct 1, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8996cfb and 7608cab.

📒 Files selected for processing (3)
  • components/core/src/clp/GrepCore.cpp (2 hunks)
  • components/core/src/clp/GrepCore.hpp (6 hunks)
  • components/core/tests/test-GrepCore.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/GrepCore.hpp
  • components/core/tests/test-GrepCore.cpp
  • components/core/src/clp/GrepCore.cpp
🧠 Learnings (1)
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/core/src/clp/GrepCore.hpp
🧬 Code graph analysis (3)
components/core/src/clp/GrepCore.hpp (3)
components/core/tests/test-GrepCore.cpp (10)
  • interpretations (71-74)
  • interpretations (71-72)
  • interpretation (95-98)
  • interpretation (95-96)
  • interpretation (100-105)
  • interpretation (100-103)
  • logtype (262-272)
  • logtype (263-264)
  • logtype (274-284)
  • logtype (274-278)
components/core/src/clp/EncodedVariableInterpreter.hpp (13)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
  • wildcard_search_dictionary_and_get_encoded_matches (225-230)
  • wildcard_search_dictionary_and_get_encoded_matches (505-530)
  • wildcard_search_dictionary_and_get_encoded_matches (505-510)
  • encoded_var (57-57)
  • encoded_var (116-116)
components/core/src/clp/GrepCore.cpp (6)
  • normalize_interpretations (153-186)
  • normalize_interpretations (153-154)
  • get_wildcard_encodable_positions (188-207)
  • get_wildcard_encodable_positions (188-189)
  • generate_logtype_string (209-276)
  • generate_logtype_string (209-212)
components/core/tests/test-GrepCore.cpp (5)
components/core/src/clp/string_utils/string_utils.cpp (2)
  • wildcard_match_unsafe_case_sensitive (236-336)
  • wildcard_match_unsafe_case_sensitive (236-236)
components/core/src/clp/GrepCore.hpp (13)
  • interpretations (152-154)
  • generate_schema_sub_queries (199-206)
  • generate_schema_sub_queries (553-625)
  • generate_schema_sub_queries (553-559)
  • interpretation (218-220)
  • interpretation (232-235)
  • process_schema_var_token (256-262)
  • process_schema_var_token (628-690)
  • process_schema_var_token (628-634)
  • value (81-86)
  • process_raw_query (61-70)
  • process_raw_query (269-390)
  • process_raw_query (269-278)
components/core/src/clp/GrepCore.cpp (8)
  • normalize_interpretations (153-186)
  • normalize_interpretations (153-154)
  • get_wildcard_encodable_positions (188-207)
  • get_wildcard_encodable_positions (188-189)
  • generate_logtype_string (209-276)
  • generate_logtype_string (209-212)
  • get_bounds_of_next_potential_var (33-151)
  • get_bounds_of_next_potential_var (33-38)
components/core/src/clp/EncodedVariableInterpreter.hpp (14)
  • id (56-56)
  • value (97-100)
  • value (107-110)
  • add_dict_var (271-276)
  • add_dict_var (558-571)
  • add_dict_var (558-563)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
components/core/src/clp/Query.cpp (2)
  • add_dict_var (102-105)
  • add_dict_var (103-103)
components/core/src/clp/GrepCore.cpp (3)
components/core/src/clp/string_utils/string_utils.cpp (2)
  • is_wildcard (146-154)
  • is_wildcard (146-146)
components/core/src/clp/GrepCore.hpp (3)
  • interpretations (152-154)
  • interpretation (218-220)
  • interpretation (232-235)
components/core/src/clp/EncodedVariableInterpreter.hpp (11)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
  • add_dict_var (271-276)
  • add_dict_var (558-571)
  • add_dict_var (558-563)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: package-image
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: antlr-code-committed (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-14, true)
🔇 Additional comments (1)
components/core/src/clp/GrepCore.cpp (1)

209-276: LGTM: efficient logtype construction

Pre-sizing, single map lookup, and typed enum compares look solid.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a130b96 and e2c55a7.

📒 Files selected for processing (2)
  • components/core/src/clp/GrepCore.hpp (6 hunks)
  • components/core/tests/test-GrepCore.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/tests/test-GrepCore.cpp
  • components/core/src/clp/GrepCore.hpp
🧠 Learnings (1)
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/core/src/clp/GrepCore.hpp
🧬 Code graph analysis (2)
components/core/tests/test-GrepCore.cpp (4)
components/core/src/clp/string_utils/string_utils.cpp (2)
  • wildcard_match_unsafe_case_sensitive (236-336)
  • wildcard_match_unsafe_case_sensitive (236-236)
components/core/src/clp/GrepCore.hpp (13)
  • interpretations (154-156)
  • generate_schema_sub_queries (201-208)
  • generate_schema_sub_queries (555-627)
  • generate_schema_sub_queries (555-561)
  • interpretation (220-222)
  • interpretation (234-237)
  • process_schema_var_token (258-264)
  • process_schema_var_token (630-692)
  • process_schema_var_token (630-636)
  • value (83-88)
  • process_raw_query (63-72)
  • process_raw_query (271-392)
  • process_raw_query (271-280)
components/core/src/clp/GrepCore.cpp (8)
  • normalize_interpretations (153-186)
  • normalize_interpretations (153-154)
  • get_wildcard_encodable_positions (188-207)
  • get_wildcard_encodable_positions (188-189)
  • generate_logtype_string (209-276)
  • generate_logtype_string (209-212)
  • get_bounds_of_next_potential_var (33-151)
  • get_bounds_of_next_potential_var (33-38)
components/core/src/clp/EncodedVariableInterpreter.hpp (14)
  • id (56-56)
  • value (97-100)
  • value (107-110)
  • add_dict_var (271-276)
  • add_dict_var (558-571)
  • add_dict_var (558-563)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
components/core/src/clp/GrepCore.hpp (4)
components/core/tests/test-GrepCore.cpp (12)
  • interpretations (71-74)
  • interpretations (71-72)
  • interpretation (95-98)
  • interpretation (95-96)
  • interpretation (100-105)
  • interpretation (100-103)
  • logtype (263-273)
  • logtype (264-265)
  • logtype (275-285)
  • logtype (275-279)
  • generate_schema_sub_queries (80-93)
  • generate_schema_sub_queries (80-85)
components/core/src/clp/QueryToken.cpp (2)
  • is_var (137-145)
  • is_var (137-137)
components/core/src/clp/EncodedVariableInterpreter.hpp (13)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
  • wildcard_search_dictionary_and_get_encoded_matches (225-230)
  • wildcard_search_dictionary_and_get_encoded_matches (505-530)
  • wildcard_search_dictionary_and_get_encoded_matches (505-510)
  • encoded_var (57-57)
  • encoded_var (116-116)
components/core/src/clp/GrepCore.cpp (6)
  • normalize_interpretations (153-186)
  • normalize_interpretations (153-154)
  • get_wildcard_encodable_positions (188-207)
  • get_wildcard_encodable_positions (188-189)
  • generate_logtype_string (209-276)
  • generate_logtype_string (209-212)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: rust-checks (macos-15)
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (ubuntu-24.04)
🔇 Additional comments (1)
components/core/tests/test-GrepCore.cpp (1)

69-124: Nice GrepCore test harness.

Line 69: The GrepCoreTest wrapper cleanly exposes the schema-focused helpers for unit coverage without widening the production surface—exactly what we need to exercise the DFA flow.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ddf3e23 and d629ae9.

📒 Files selected for processing (3)
  • components/core/src/clp/GrepCore.cpp (2 hunks)
  • components/core/src/clp/GrepCore.hpp (6 hunks)
  • components/core/tests/test-GrepCore.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/tests/test-GrepCore.cpp
  • components/core/src/clp/GrepCore.cpp
  • components/core/src/clp/GrepCore.hpp
🧠 Learnings (1)
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/core/src/clp/GrepCore.hpp
🧬 Code graph analysis (3)
components/core/tests/test-GrepCore.cpp (5)
components/core/src/clp/string_utils/string_utils.cpp (2)
  • wildcard_match_unsafe_case_sensitive (236-336)
  • wildcard_match_unsafe_case_sensitive (236-236)
components/core/src/clp/GrepCore.hpp (13)
  • interpretations (154-156)
  • generate_schema_sub_queries (201-208)
  • generate_schema_sub_queries (557-636)
  • generate_schema_sub_queries (557-563)
  • interpretation (220-222)
  • interpretation (235-239)
  • process_schema_var_token (260-266)
  • process_schema_var_token (639-701)
  • process_schema_var_token (639-645)
  • value (83-88)
  • process_raw_query (63-72)
  • process_raw_query (273-394)
  • process_raw_query (273-282)
components/core/src/clp/GrepCore.cpp (8)
  • normalize_interpretations (153-186)
  • normalize_interpretations (153-154)
  • get_wildcard_encodable_positions (188-207)
  • get_wildcard_encodable_positions (188-189)
  • generate_logtype_string (209-280)
  • generate_logtype_string (209-213)
  • get_bounds_of_next_potential_var (33-151)
  • get_bounds_of_next_potential_var (33-38)
components/core/src/clp/EncodedVariableInterpreter.hpp (14)
  • id (56-56)
  • value (97-100)
  • value (107-110)
  • add_dict_var (271-276)
  • add_dict_var (558-571)
  • add_dict_var (558-563)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
components/core/src/clp/Query.cpp (2)
  • add_dict_var (102-105)
  • add_dict_var (103-103)
components/core/src/clp/GrepCore.cpp (3)
components/core/src/clp/string_utils/string_utils.cpp (2)
  • is_wildcard (146-154)
  • is_wildcard (146-146)
components/core/src/clp/GrepCore.hpp (3)
  • interpretations (154-156)
  • interpretation (220-222)
  • interpretation (235-239)
components/core/src/clp/EncodedVariableInterpreter.hpp (13)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
  • add_dict_var (271-276)
  • add_dict_var (558-571)
  • add_dict_var (558-563)
  • encoded_var (57-57)
  • encoded_var (116-116)
components/core/src/clp/GrepCore.hpp (2)
components/core/src/clp/EncodedVariableInterpreter.hpp (13)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
  • wildcard_search_dictionary_and_get_encoded_matches (225-230)
  • wildcard_search_dictionary_and_get_encoded_matches (505-530)
  • wildcard_search_dictionary_and_get_encoded_matches (505-510)
  • encoded_var (57-57)
  • encoded_var (116-116)
components/core/src/clp/GrepCore.cpp (6)
  • normalize_interpretations (153-186)
  • normalize_interpretations (153-154)
  • get_wildcard_encodable_positions (188-207)
  • get_wildcard_encodable_positions (188-189)
  • generate_logtype_string (209-280)
  • generate_logtype_string (209-213)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: build-macos (macos-15, false)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9920bc9 and e7dcc43.

📒 Files selected for processing (1)
  • components/core/src/clp/Query.hpp (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/Query.hpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (macos-15)
🔇 Additional comments (1)
components/core/src/clp/Query.hpp (1)

162-162: LGTM!

Default-initializing m_wildcard_match_required to false is a good practice that prevents undefined behaviour and ensures deterministic behaviour when SubQuery objects are compared. The default value is semantically correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants