feat: Add more debugging to test-lexer.cpp; Add lexer schema to NFA and DFA tests. #92

SharafMohamed · 2025-02-03T19:16:30Z

References

Depends on feat: Add a unique state id to each NFA state. #90.
To review in parallel run:

git fetch upstream pull/90/head:pr-90
git fetch upstream pull/91/head:pr-91
git diff pr-90 pr-91

Description

Previously if test-lexer.cpp failed it was not possible to tell where it failed or why it failed.
Added more CAPTUREs to test-lexer.cpp.
Added lexer schema userId=(?<uID>123) to NFA and DFA tests.

Validation performed

Updated test-lexer.cpp.

Summary by CodeRabbit

New Features
- Enhanced text processing with improved lexing and regex capturing.
- Extended finite automata capabilities with advanced state management, transition handling, and serialization.
- Introduced a unique identifier generator and refined register operations.
Refactor
- Consolidated legacy tag handling into a clearer capture-based approach.
- Updated APIs and type interfaces for improved reliability and type safety.
- Streamlined state transitions and error handling mechanisms.
Tests
- Added new test cases and restructured existing tests to ensure robust validation of core functionality.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

src/log_surgeon/Lexer.tpp (3)
246-247: ⚠️ Potential issue

Add null check before calling is_accepting()

The code calls dest_state->is_accepting() without verifying that dest_state is non-null.

Apply this diff to add the necessary null check:
auto* dest_state{state->get_dest_state(byte)};
-if (false == dest_state->is_accepting()) {
+if (nullptr == dest_state || false == dest_state->is_accepting()) {
    // handle accordingly
}
280-283: ⚠️ Potential issue

Add null check before pushing to unvisited_states

The code pushes dest_state into unvisited_states without verifying it's non-null.

Apply this diff to add the necessary null check:
TypedDfaState const* dest_state{current_state->get_dest_state(byte)};
+if (nullptr == dest_state) {
+    continue;
+}
if (false == visited_states.contains(dest_state)) {
    unvisited_states.push(dest_state);
}
87-98: ⚠️ Potential issue

Add null check before dereferencing dest_state

At line 93, dest_state is dereferenced without verifying it's non-null. This could lead to a crash if get_dest_state returns nullptr.

Apply this diff to add the necessary null check:
 if (m_has_delimiters && !m_match) {
     dest_state = m_dfa->get_root()->get_dest_state(next_char);
+    if (nullptr == dest_state) {
+        continue;
+    }
     m_match = true;
     m_type_ids = &(dest_state->get_matching_variable_ids());

🧹 Nitpick comments (6)

tests/test-lexer.cpp (2)
185-185: Track the TODO comment for register values verification.

The TODO comment indicates that register values verification is pending the implementation of DFA simulation.

Would you like me to create an issue to track this TODO item? I can help implement the register values verification once the DFA simulation is complete.

151-186: Enhance error handling in test_scanning_input.

The function assumes successful token scanning but should handle potential error cases more robustly.

Consider adding test cases for error scenarios:
 auto test_scanning_input(ByteLexer& lexer, std::string_view input, std::string_view rule_name)
         -> void {
     CAPTURE(input);
-    CAPTURE(symbol);
+    CAPTURE(rule_name);
     
     lexer.reset();
     CAPTURE(serialize_id_symbol_map(lexer.m_id_symbol));
 
     log_surgeon::ParserInputBuffer input_buffer;
     string token_string{input};
     input_buffer.set_storage(token_string.data(), token_string.size(), 0, true);
     lexer.prepend_start_of_file_char(input_buffer);
 
     log_surgeon::Token token;
     auto error_code{lexer.scan(input_buffer, token)};
+    CAPTURE(error_code);
     REQUIRE(nullptr != token.m_type_ids_ptr);
     CAPTURE(token.to_string_view());
     CAPTURE(*token.m_type_ids_ptr);
     REQUIRE(log_surgeon::ErrorCode::Success == error_code);
src/log_surgeon/finite_automata/NfaState.hpp (1)

221-222: Address the TODO comment regarding UTF-8 handling.

The TODO comment indicates missing UTF-8 case handling in the BFS traversal. This could lead to incomplete traversal for UTF-8 characters.

Would you like me to help implement the UTF-8 case handling or create an issue to track this task?

src/log_surgeon/finite_automata/Nfa.hpp (1)

119-121: Consider addressing the TODO comment about capture group naming.

The TODO comment highlights a limitation in the current implementation where capture groups must have unique names. This could restrict valid use cases.

Would you like me to help design a solution that allows for non-unique capture group names?
src/log_surgeon/Lexer.tpp (2)
380-408: Add validation for empty capture names

While the changes improve capture handling, consider adding validation to ensure capture names are not empty.

Add this validation before inserting into m_symbol_id:
 for (auto const& rule : m_rules) {
     for (auto const* capture : rule.get_captures()) {
         std::string const capture_name{capture->get_name()};
+        if (capture_name.empty()) {
+            throw std::invalid_argument("Capture names cannot be empty");
+        }
         if (m_symbol_id.contains(capture_name)) {
             throw std::invalid_argument("`m_rules` contains capture names that are not unique.");
         }
406-407: Track DFA's capture handling limitation

The TODO comment indicates that DFA ignores captures, which could lead to functionality loss.

Would you like me to create an issue to track this limitation and propose a solution for handling captures in DFA?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 70c1464 and ca58703.

📒 Files selected for processing (14)

CMakeLists.txt (2 hunks)
src/log_surgeon/Aliases.hpp (1 hunks)
src/log_surgeon/Lexer.hpp (5 hunks)
src/log_surgeon/Lexer.tpp (10 hunks)
src/log_surgeon/LexicalRule.hpp (3 hunks)
src/log_surgeon/UniqueIdGenerator.hpp (1 hunks)
src/log_surgeon/finite_automata/Dfa.hpp (5 hunks)
src/log_surgeon/finite_automata/Nfa.hpp (6 hunks)
src/log_surgeon/finite_automata/NfaState.hpp (6 hunks)
src/log_surgeon/finite_automata/PrefixTree.hpp (1 hunks)
src/log_surgeon/finite_automata/RegexAST.hpp (20 hunks)
src/log_surgeon/finite_automata/TaggedTransition.hpp (3 hunks)
tests/CMakeLists.txt (2 hunks)
tests/test-lexer.cpp (7 hunks)

✅ Files skipped from review due to trivial changes (1)

src/log_surgeon/Aliases.hpp

🚧 Files skipped from review as they are similar to previous changes (4)

src/log_surgeon/UniqueIdGenerator.hpp
src/log_surgeon/finite_automata/PrefixTree.hpp
src/log_surgeon/finite_automata/Dfa.hpp
src/log_surgeon/Lexer.hpp

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/LexicalRule.hpp
tests/test-lexer.cpp
src/log_surgeon/finite_automata/TaggedTransition.hpp
src/log_surgeon/finite_automata/NfaState.hpp
src/log_surgeon/finite_automata/Nfa.hpp
src/log_surgeon/finite_automata/RegexAST.hpp

🧠 Learnings (4)

src/log_surgeon/finite_automata/TaggedTransition.hpp (3)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

src/log_surgeon/finite_automata/NfaState.hpp (1)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

src/log_surgeon/finite_automata/Nfa.hpp (1)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

src/log_surgeon/finite_automata/RegexAST.hpp (3)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

🔇 Additional comments (16)

tests/test-lexer.cpp (2)

1-1: Replace deprecated UTF conversion utilities.

The code still uses deprecated codecvt_utf8 and wstring_convert. Consider using modern alternatives like the UTF-8 facilities in C++20's <text_encoding> header or a third-party library like ICU.

Also applies to: 108-109

338-396: Well-structured test cases with improved debugging!

The new test cases effectively validate lexer functionality with and without capture groups. The use of CAPTURE macros enhances debugging output, making it easier to identify failure causes.

src/log_surgeon/LexicalRule.hpp (1)

28-30: LGTM! Well-designed getter method.

The new get_captures() method is well-implemented with proper const correctness and efficient return by const reference.

src/log_surgeon/finite_automata/TaggedTransition.hpp (2)

23-24: LGTM! Improved type safety with tag IDs.

The transition from Tag const* to tag_id_t in PositiveTaggedTransition eliminates potential null pointer issues and improves type safety.

Also applies to: 43-43

54-55: LGTM! Enhanced vector handling.

The transition from std::vector<Tag const*> to std::vector<tag_id_t> in NegativeTaggedTransition simplifies memory management and improves type safety. The use of std::move for the vector is efficient.

Also applies to: 74-74
src/log_surgeon/finite_automata/NfaState.hpp (2)

41-41: LGTM! Well-structured state management.

The new member variables and constructor changes provide better state tracking with:

Unique ID generation for states and tags

Clear mapping of state relationships

Proper memory management with smart pointers

Also applies to: 119-124

183-199: Consider adding error handling for tag pair creation.

The new_start_and_end_states_for_capture method assumes tag pair creation will always succeed. Consider adding error handling for potential failure cases.
❌ Verification inconclusive

🏁 Script executed:
#!/bin/bash
# Check for error handling in similar methods
rg -A 5 "create.*tag" .
Length of output: 2360

Subject: Review Comment on Error Handling for Tag Pair Creation

The review comment suggests that the new_start_and_end_states_for_capture method should include error handling in case tag pair creation fails. Our search for similar methods (namely, get_or_create_capture_tag_pair in src/log_surgeon/finite_automata/Nfa.hpp) confirms that error handling is not currently implemented there either—indicating that tag generation is assumed always to succeed.

Since both methods rely on m_tag_id_generator.generate_id() without explicit error checks, we acknowledge the concern. Although similar patterns exist elsewhere, this area may benefit from robust error handling to cover potential failure paths (e.g., if ID generation fails or returns an invalid value).

Please consider revising the new_start_and_end_states_for_capture method to include error handling logic to verify that valid tag pairs are created. This might include checking the result of each call to generate_id() and handling any exceptions or error codes accordingly.
src/log_surgeon/finite_automata/Nfa.hpp (2)

136-144: LGTM! Well-implemented capture tag pair management.

The get_or_create_capture_tag_pair method efficiently manages tag pairs with:

Proper caching of existing pairs

Thread-safe ID generation

Clear error conditions

235-252: LGTM! Improved error handling in serialization.

The serialization method now properly handles potential failures by:

Using std::optional for error propagation

Clear error conditions

Efficient string building with fmt::format

src/log_surgeon/Lexer.tpp (2)

361-367: LGTM! Improved type safety with rule_id_t

The change from uint32_t const& id to rule_id_t const rule_id enhances type safety by using a more specific type for rule IDs.

369-378: LGTM! Consistent use of rule_id_t

The change to use rule_id_t in get_rule maintains consistency with the type system changes.

src/log_surgeon/finite_automata/RegexAST.hpp (3)

33-39: LGTM! Clear and accurate documentation

The updated documentation clearly explains the role of captures and their relationship with the AST nodes.

632-634: LGTM! Well-documented invariants

The documentation clearly specifies the non-null requirements for m_capture and m_capture_regex_ast, which helps prevent null pointer issues.

652-657: LGTM! Robust null pointer validation

The constructor properly validates the non-null requirements with clear error messages.

tests/CMakeLists.txt (1)

35-43: LGTM! Improved test organization

The changes improve readability with a better format and add necessary test coverage for new components.

CMakeLists.txt (1)

64-116: LGTM! Well-organized source files

The changes properly integrate new components while maintaining a logical organization.

SharafMohamed added 30 commits December 6, 2024 18:52

Initialize byte transitions.

244d122

Use const* in place of unique_ptr reference; Update docstrings.

176391b

Update intersect test to compile.

012f61f

Update next() docstring.

96a6363

Merge branch 'individual-dfa-files' into remove-redundant-typenames

7cd39f0

Rename to state_type.

a4a93b4

Update headers.

421c3de

Merge branch 'individual-dfa-files' into remove-redundant-typenames

ecd4e4e

Update Lexer headers.

1b945a1

Merge branch 'individual-dfa-files' into remove-redundant-typenames

70cd43d

Add header for conditional_t.

78c4125

Merge branch 'individual-dfa-files' into remove-redundant-typenames

3ce0b30

Linter.

33623fa

Merge branch 'individual-dfa-files' into remove-redundant-typenames

bae0557

Linter.

5bbeafc

Change ! to false ==.

0decaf5

Merge branch 'individual-dfa-files' into remove-redundant-typenames

0e2d593

Merge branch 'remove-redundant-typenames' into remove-regex-prefix

cf5980d

LALR1Parser to Lalr1Parser.

6e65a3e

Linter.

9c2ad81

Rename templates to TypedDfaState and TypedNfaState.

c4fc96b

Rename to Utf8*State and Byte*State.

a6bbaef

Remove RegexDFAStateType.hpp.

ff2dac3

Linter.

7a8982d

Linter again.

7b0a86c

Add missing alogrithm header; Update test-NFA.

2632616

Update intersect-test.cpp with new names.

992a2ec

Linter on intersect-test.cpp.

a5419f0

Merge branch 'remove-regex-prefix' into remove-uneeded-this

ae64f64

Switch this->m_lexer to m_lexer by using Parser::m_lexer.

9153a7c

SharafMohamed added 16 commits February 10, 2025 10:30

Use contains in Lexer.hpp.

630749a

Switch find to contains in TaggedTransition.hpp.

ddae224

Merge branch 'restructure-tagging' into fix-tagged-transitions

1b950fd

Merge branch 'fix-tagged-transitions' into raii-accepting-state

54d7272

Merge branch 'raii-accepting-state' into clang-tidy-pre-tdfa

a87907d

Add missing header.

a2d1825

Merge branch 'restructure-tagging' into fix-tagged-transitions

f7db608

Merge branch 'fix-tagged-transitions' into raii-accepting-state

9db7584

Merge branch 'raii-accepting-state' into clang-tidy-pre-tdfa

718eb83

Merge branch 'clang-tidy-pre-tdfa' into rename-vars

da14105

Merge branch 'rename-vars' into two-template-dfa

9d50cf9

Merge branch 'two-template-dfa' into register-handler-cleanup

8eb6c10

Merge branch 'register-handler-cleanup' into add-register-operations

c8b59d2

Merge branch 'add-register-operations' into state-ids

ab05807

Merge branch 'state-ids' into detailed-lexer-debuging

ca58703

Replace find with contains.

c6a926c

coderabbitai bot reviewed Feb 11, 2025

View reviewed changes

SharafMohamed added 13 commits February 11, 2025 18:50

Rename Aliases.hpp to types.hpp.

983ddcb

initialize_lexer now creates and returns the lexer.

303de17

Fix compiler error.

9e99f40

Merge branch 'restructure-tagging' into fix-tagged-transitions

bc81182

Remove unused header.

fb2de27

Merge branch 'fix-tagged-transitions' into raii-accepting-state

6ab9b85

Merge branch 'raii-accepting-state' into clang-tidy-pre-tdfa

b2945c1

Merge branch 'clang-tidy-pre-tdfa' into rename-vars

2a4a842

Merge branch 'rename-vars' into two-template-dfa

9bcdeae

Merge branch 'two-template-dfa' into register-handler-cleanup

f61f96e

Merge branch 'register-handler-cleanup' into add-register-operations

80e8f3c

Merge branch 'add-register-operations' into state-ids

467d3f1

Merge branch 'state-ids' into detailed-lexer-debuging

bde391a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add more debugging to test-lexer.cpp; Add lexer schema to NFA and DFA tests. #92

feat: Add more debugging to test-lexer.cpp; Add lexer schema to NFA and DFA tests. #92

SharafMohamed commented Feb 3, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot left a comment

feat: Add more debugging to test-lexer.cpp; Add lexer schema to NFA and DFA tests. #92

Are you sure you want to change the base?

feat: Add more debugging to test-lexer.cpp; Add lexer schema to NFA and DFA tests. #92

Conversation

SharafMohamed commented Feb 3, 2025 • edited by coderabbitai bot Loading

References

Description

Validation performed

Summary by CodeRabbit

coderabbitai bot left a comment

Choose a reason for hiding this comment

SharafMohamed commented Feb 3, 2025 •

edited by coderabbitai bot

Loading