Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add more debugging to test-lexer.cpp; Add lexer schema to NFA and DFA tests. #92

Open
wants to merge 648 commits into
base: main
Choose a base branch
from

Conversation

SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Feb 3, 2025

References

git fetch upstream pull/90/head:pr-90
git fetch upstream pull/91/head:pr-91
git diff pr-90 pr-91

Description

  • Previously if test-lexer.cpp failed it was not possible to tell where it failed or why it failed.
  • Added more CAPTUREs to test-lexer.cpp.
  • Added lexer schema userId=(?<uID>123) to NFA and DFA tests.

Validation performed

  • Updated test-lexer.cpp.

Summary by CodeRabbit

  • New Features

    • Enhanced text processing with improved lexing and regex capturing.
    • Extended finite automata capabilities with advanced state management, transition handling, and serialization.
    • Introduced a unique identifier generator and refined register operations.
  • Refactor

    • Consolidated legacy tag handling into a clearer capture-based approach.
    • Updated APIs and type interfaces for improved reliability and type safety.
    • Streamlined state transitions and error handling mechanisms.
  • Tests

    • Added new test cases and restructured existing tests to ensure robust validation of core functionality.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
src/log_surgeon/Lexer.tpp (3)

246-247: ⚠️ Potential issue

Add null check before calling is_accepting()

The code calls dest_state->is_accepting() without verifying that dest_state is non-null.

Apply this diff to add the necessary null check:

auto* dest_state{state->get_dest_state(byte)};
-if (false == dest_state->is_accepting()) {
+if (nullptr == dest_state || false == dest_state->is_accepting()) {
    // handle accordingly
}

280-283: ⚠️ Potential issue

Add null check before pushing to unvisited_states

The code pushes dest_state into unvisited_states without verifying it's non-null.

Apply this diff to add the necessary null check:

TypedDfaState const* dest_state{current_state->get_dest_state(byte)};
+if (nullptr == dest_state) {
+    continue;
+}
if (false == visited_states.contains(dest_state)) {
    unvisited_states.push(dest_state);
}

87-98: ⚠️ Potential issue

Add null check before dereferencing dest_state

At line 93, dest_state is dereferenced without verifying it's non-null. This could lead to a crash if get_dest_state returns nullptr.

Apply this diff to add the necessary null check:

 if (m_has_delimiters && !m_match) {
     dest_state = m_dfa->get_root()->get_dest_state(next_char);
+    if (nullptr == dest_state) {
+        continue;
+    }
     m_match = true;
     m_type_ids = &(dest_state->get_matching_variable_ids());
🧹 Nitpick comments (6)
tests/test-lexer.cpp (2)

185-185: Track the TODO comment for register values verification.

The TODO comment indicates that register values verification is pending the implementation of DFA simulation.

Would you like me to create an issue to track this TODO item? I can help implement the register values verification once the DFA simulation is complete.


151-186: Enhance error handling in test_scanning_input.

The function assumes successful token scanning but should handle potential error cases more robustly.

Consider adding test cases for error scenarios:

 auto test_scanning_input(ByteLexer& lexer, std::string_view input, std::string_view rule_name)
         -> void {
     CAPTURE(input);
-    CAPTURE(symbol);
+    CAPTURE(rule_name);
     
     lexer.reset();
     CAPTURE(serialize_id_symbol_map(lexer.m_id_symbol));
 
     log_surgeon::ParserInputBuffer input_buffer;
     string token_string{input};
     input_buffer.set_storage(token_string.data(), token_string.size(), 0, true);
     lexer.prepend_start_of_file_char(input_buffer);
 
     log_surgeon::Token token;
     auto error_code{lexer.scan(input_buffer, token)};
+    CAPTURE(error_code);
     REQUIRE(nullptr != token.m_type_ids_ptr);
     CAPTURE(token.to_string_view());
     CAPTURE(*token.m_type_ids_ptr);
     REQUIRE(log_surgeon::ErrorCode::Success == error_code);
src/log_surgeon/finite_automata/NfaState.hpp (1)

221-222: Address the TODO comment regarding UTF-8 handling.

The TODO comment indicates missing UTF-8 case handling in the BFS traversal. This could lead to incomplete traversal for UTF-8 characters.

Would you like me to help implement the UTF-8 case handling or create an issue to track this task?

src/log_surgeon/finite_automata/Nfa.hpp (1)

119-121: Consider addressing the TODO comment about capture group naming.

The TODO comment highlights a limitation in the current implementation where capture groups must have unique names. This could restrict valid use cases.

Would you like me to help design a solution that allows for non-unique capture group names?

src/log_surgeon/Lexer.tpp (2)

380-408: Add validation for empty capture names

While the changes improve capture handling, consider adding validation to ensure capture names are not empty.

Add this validation before inserting into m_symbol_id:

 for (auto const& rule : m_rules) {
     for (auto const* capture : rule.get_captures()) {
         std::string const capture_name{capture->get_name()};
+        if (capture_name.empty()) {
+            throw std::invalid_argument("Capture names cannot be empty");
+        }
         if (m_symbol_id.contains(capture_name)) {
             throw std::invalid_argument("`m_rules` contains capture names that are not unique.");
         }

406-407: Track DFA's capture handling limitation

The TODO comment indicates that DFA ignores captures, which could lead to functionality loss.

Would you like me to create an issue to track this limitation and propose a solution for handling captures in DFA?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 70c1464 and ca58703.

📒 Files selected for processing (14)
  • CMakeLists.txt (2 hunks)
  • src/log_surgeon/Aliases.hpp (1 hunks)
  • src/log_surgeon/Lexer.hpp (5 hunks)
  • src/log_surgeon/Lexer.tpp (10 hunks)
  • src/log_surgeon/LexicalRule.hpp (3 hunks)
  • src/log_surgeon/UniqueIdGenerator.hpp (1 hunks)
  • src/log_surgeon/finite_automata/Dfa.hpp (5 hunks)
  • src/log_surgeon/finite_automata/Nfa.hpp (6 hunks)
  • src/log_surgeon/finite_automata/NfaState.hpp (6 hunks)
  • src/log_surgeon/finite_automata/PrefixTree.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (20 hunks)
  • src/log_surgeon/finite_automata/TaggedTransition.hpp (3 hunks)
  • tests/CMakeLists.txt (2 hunks)
  • tests/test-lexer.cpp (7 hunks)
✅ Files skipped from review due to trivial changes (1)
  • src/log_surgeon/Aliases.hpp
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/log_surgeon/UniqueIdGenerator.hpp
  • src/log_surgeon/finite_automata/PrefixTree.hpp
  • src/log_surgeon/finite_automata/Dfa.hpp
  • src/log_surgeon/Lexer.hpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • src/log_surgeon/LexicalRule.hpp
  • tests/test-lexer.cpp
  • src/log_surgeon/finite_automata/TaggedTransition.hpp
  • src/log_surgeon/finite_automata/NfaState.hpp
  • src/log_surgeon/finite_automata/Nfa.hpp
  • src/log_surgeon/finite_automata/RegexAST.hpp
🧠 Learnings (4)
src/log_surgeon/finite_automata/TaggedTransition.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/NfaState.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/Nfa.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/RegexAST.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
🔇 Additional comments (16)
tests/test-lexer.cpp (2)

1-1: Replace deprecated UTF conversion utilities.

The code still uses deprecated codecvt_utf8 and wstring_convert. Consider using modern alternatives like the UTF-8 facilities in C++20's <text_encoding> header or a third-party library like ICU.

Also applies to: 108-109


338-396: Well-structured test cases with improved debugging!

The new test cases effectively validate lexer functionality with and without capture groups. The use of CAPTURE macros enhances debugging output, making it easier to identify failure causes.

src/log_surgeon/LexicalRule.hpp (1)

28-30: LGTM! Well-designed getter method.

The new get_captures() method is well-implemented with proper const correctness and efficient return by const reference.

src/log_surgeon/finite_automata/TaggedTransition.hpp (2)

23-24: LGTM! Improved type safety with tag IDs.

The transition from Tag const* to tag_id_t in PositiveTaggedTransition eliminates potential null pointer issues and improves type safety.

Also applies to: 43-43


54-55: LGTM! Enhanced vector handling.

The transition from std::vector<Tag const*> to std::vector<tag_id_t> in NegativeTaggedTransition simplifies memory management and improves type safety. The use of std::move for the vector is efficient.

Also applies to: 74-74

src/log_surgeon/finite_automata/NfaState.hpp (2)

41-41: LGTM! Well-structured state management.

The new member variables and constructor changes provide better state tracking with:

  • Unique ID generation for states and tags
  • Clear mapping of state relationships
  • Proper memory management with smart pointers

Also applies to: 119-124


183-199: Consider adding error handling for tag pair creation.

The new_start_and_end_states_for_capture method assumes tag pair creation will always succeed. Consider adding error handling for potential failure cases.

❌ Verification inconclusive

🏁 Script executed:

#!/bin/bash
# Check for error handling in similar methods
rg -A 5 "create.*tag" .

Length of output: 2360


Subject: Review Comment on Error Handling for Tag Pair Creation

The review comment suggests that the new_start_and_end_states_for_capture method should include error handling in case tag pair creation fails. Our search for similar methods (namely, get_or_create_capture_tag_pair in src/log_surgeon/finite_automata/Nfa.hpp) confirms that error handling is not currently implemented there either—indicating that tag generation is assumed always to succeed.

Since both methods rely on m_tag_id_generator.generate_id() without explicit error checks, we acknowledge the concern. Although similar patterns exist elsewhere, this area may benefit from robust error handling to cover potential failure paths (e.g., if ID generation fails or returns an invalid value).

Please consider revising the new_start_and_end_states_for_capture method to include error handling logic to verify that valid tag pairs are created. This might include checking the result of each call to generate_id() and handling any exceptions or error codes accordingly.

src/log_surgeon/finite_automata/Nfa.hpp (2)

136-144: LGTM! Well-implemented capture tag pair management.

The get_or_create_capture_tag_pair method efficiently manages tag pairs with:

  • Proper caching of existing pairs
  • Thread-safe ID generation
  • Clear error conditions

235-252: LGTM! Improved error handling in serialization.

The serialization method now properly handles potential failures by:

  • Using std::optional for error propagation
  • Clear error conditions
  • Efficient string building with fmt::format
src/log_surgeon/Lexer.tpp (2)

361-367: LGTM! Improved type safety with rule_id_t

The change from uint32_t const& id to rule_id_t const rule_id enhances type safety by using a more specific type for rule IDs.


369-378: LGTM! Consistent use of rule_id_t

The change to use rule_id_t in get_rule maintains consistency with the type system changes.

src/log_surgeon/finite_automata/RegexAST.hpp (3)

33-39: LGTM! Clear and accurate documentation

The updated documentation clearly explains the role of captures and their relationship with the AST nodes.


632-634: LGTM! Well-documented invariants

The documentation clearly specifies the non-null requirements for m_capture and m_capture_regex_ast, which helps prevent null pointer issues.


652-657: LGTM! Robust null pointer validation

The constructor properly validates the non-null requirements with clear error messages.

tests/CMakeLists.txt (1)

35-43: LGTM! Improved test organization

The changes improve readability with a better format and add necessary test coverage for new components.

CMakeLists.txt (1)

64-116: LGTM! Well-organized source files

The changes properly integrate new components while maintaining a logical organization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant