feat: separate the concept of captures and tags; lexer now tracks mapping from variables to capture to tags to registers. #72

SharafMohamed · 2025-01-13T10:54:20Z

References

Depends on fix(gh-workflow): Lock Ubuntu runner to 22.04 as a temporary workaround for #75. #80 in order to compile on Git.

Description

Previously tags were being used to refer to a single capture group, as well as the start and end markers for a capture group's position in the NFA.
- The former has been changed to be referred to as a capture.
- The latter is now stored as a unique unsigned integer.
To simplify information tracking and ownership transfer, the lexer is now responsible for keeping track of all the relational information it will need after parsing. This includes:
- A map from each variable id to the capture id's for the groups the variable contains.
- A map from each capture to its start and end tag.
- A map from each tag to its final register.
Fix filer order in cmake.
Use *_id_t aliases to make it more clear what the intended purpose of the maps are.

Validation performed

Added new unit-test for the lexer's base functionality.
Added new unit-test for the lexer's capture group functionality. This includes testing the maps that can currently be assigned.

Summary by CodeRabbit

New Features
- Enhanced regular expression processing and capture handling for improved error reporting and consistent token management.
- Upgraded lexical analysis to provide more robust recognition and assignment of tokens, ensuring smoother operation.
- Introduced a mechanism for generating unique identifiers and clear type definitions, bolstering overall system reliability.
- Added new methods to the Lexer class for improved capture and tag management.
- Introduced a new Capture class to replace the previous Tag class, streamlining capture handling.
Tests
- Expanded test coverage to validate the new capture and lexing improvements, ensuring higher quality and stability.
- Introduced unit tests for the new Capture class and enhanced tests for the lexer.
Chores
- Streamlined build configuration and source management for improved development efficiency.

…g_id; Remove error checking in favor of using .at().

…e root.

Co-authored-by: Lin Zhihao <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

tests/test-lexer.cpp (2)
69-78: Fix typo in documentation.

There's a typo in the documentation: "inptut" should be "input".
- * end of inptut token.
+ * end of input token.
200-200: Fix clang-format violation.

The line length exceeds the formatting rules. Consider breaking the line into multiple lines.
-        auto const* regex_ast_cat_ptr = dynamic_cast<RegexASTCatByte*>(schema_var_ast.m_regex_ptr.get());
+        auto const* regex_ast_cat_ptr
+                = dynamic_cast<RegexASTCatByte*>(schema_var_ast.m_regex_ptr.get());
🧰 Tools

🪛 GitHub Actions: lint

[error] 200-200: code should be clang-formatted [-Wclang-format-violations]
src/log_surgeon/finite_automata/RegexAST.hpp (1)
113-126: Rename method to align with new terminology.

The method name add_to_nfa_with_negative_tags should be renamed to add_to_nfa_with_negative_captures to maintain consistency with the new capture-based terminology.

Apply this diff to rename the method:
-    add_to_nfa_with_negative_tags(Nfa<TypedNfaState>* nfa, TypedNfaState* end_state) const -> void {
+    add_to_nfa_with_negative_captures(Nfa<TypedNfaState>* nfa, TypedNfaState* end_state) const -> void {

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 439895b and 27ef30b.

📒 Files selected for processing (6)

src/log_surgeon/Lexer.tpp (3 hunks)
src/log_surgeon/LexicalRule.hpp (1 hunks)
src/log_surgeon/finite_automata/Dfa.hpp (1 hunks)
src/log_surgeon/finite_automata/RegexAST.hpp (21 hunks)
src/log_surgeon/finite_automata/TaggedTransition.hpp (3 hunks)
tests/test-lexer.cpp (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

src/log_surgeon/finite_automata/Dfa.hpp
src/log_surgeon/Lexer.tpp
src/log_surgeon/finite_automata/TaggedTransition.hpp

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/LexicalRule.hpp
tests/test-lexer.cpp
src/log_surgeon/finite_automata/RegexAST.hpp

🧠 Learnings (1)

src/log_surgeon/finite_automata/RegexAST.hpp (2)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.

🪛 GitHub Actions: lint

tests/test-lexer.cpp

[error] 200-200: code should be clang-formatted [-Wclang-format-violations]

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: build (ubuntu-22.04, Release)
GitHub Check: build (ubuntu-22.04, Debug)

🔇 Additional comments (12)

tests/test-lexer.cpp (4)

1-30: LGTM!

The includes and using declarations are well-organized and appropriate for the changes.

138-163: LGTM!

The implementation thoroughly validates token scanning and includes a clear TODO for future work.

306-320: LGTM!

The test case is well-structured and effectively tests basic lexer functionality.

322-360: LGTM!

The test case thoroughly validates capture group functionality and includes a clear TODO for future work.

src/log_surgeon/LexicalRule.hpp (1)

26-28: LGTM!

The implementation is correct and follows best practices:

Uses [[nodiscard]] to prevent ignoring the return value.

Returns a const reference to avoid unnecessary copying.

Correctly forwards the call to get_subtree_positive_captures().
src/log_surgeon/finite_automata/RegexAST.hpp (7)

22-24: LGTM!

The header file changes and documentation updates correctly reflect the transition from tags to captures.

Also applies to: 32-38

86-106: LGTM!

The member functions are correctly updated to handle captures:

get_subtree_positive_captures() returns a const reference to avoid copying.

set_subtree_positive_captures() and add_subtree_positive_captures() correctly manage the capture list.

set_negative_captures() correctly handles negative captures.

134-152: LGTM!

The serialization changes effectively use modern C++ features:

Range-based transformations for clean iteration.

fmt::format for string formatting.

Proper handling of empty captures.

155-156: LGTM!

The member variables are correctly updated to use Capture pointers.

647-664: Move null checks to constructor body.

Throwing exceptions in the member initializer list can lead to issues with object construction and exception safety.

Apply this diff to move the null checks into the constructor body:
-        : m_group_regex_ast{(
-              nullptr == group_regex_ast
-                      ? throw std::invalid_argument("Group regex AST cannot be null")
-                      : std::move(group_regex_ast)
-          )},
-          m_capture{
-              nullptr == capture ? throw std::invalid_argument("Capture cannot be null")
-                                 : std::move(capture)
-          } {
+        {
+            if (nullptr == group_regex_ast) {
+                throw std::invalid_argument("Group regex AST cannot be null");
+            }
+            if (nullptr == capture) {
+                throw std::invalid_argument("Capture cannot be null");
+            }
+            m_group_regex_ast = std::move(group_regex_ast);
+            m_capture = std::move(capture);
712-717: Rename methods to align with new naming convention.

For consistency with the new naming convention:

Apply this diff to rename the methods:
-    [[nodiscard]] auto get_group_name() const -> std::string_view { return m_capture->get_name(); }
+    [[nodiscard]] auto get_capture_name() const -> std::string_view { return m_capture->get_name(); }

-    [[nodiscard]] auto get_group_regex_ast(
+    [[nodiscard]] auto get_capture_regex_ast(
908-939: LGTM!

The NFA construction diagram accurately represents the structure with:

Positive tagged start transition for capture start.

Epsilon transition to nested NFA.

Negative tagged transition with negative captures.

Positive tagged end transition for capture end.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/log_surgeon/Lexer.hpp (2)

149-156: Consider using contains and at for map access.

Replace iterator-based map access with contains and at for consistency with other functions.

-    auto const tag_id_pair{m_capture_id_to_tag_id_pair.find(capture_id)};
-    if (m_capture_id_to_tag_id_pair.end() == tag_id_pair) {
-        return std::nullopt;
-    }
-    return tag_id_pair->second;
+    if (m_capture_id_to_tag_id_pair.contains(capture_id)) {
+        return m_capture_id_to_tag_id_pair.at(capture_id);
+    }
+    return std::nullopt;

164-171: Consider using contains and at for map access.

Replace iterator-based map access with contains and at for consistency with other functions.

-    auto const it{m_tag_to_register_id.find(tag_id)};
-    if (m_tag_to_register_id.end() == it) {
-        return std::nullopt;
-    }
-    return it->second;
+    if (m_tag_to_register_id.contains(tag_id)) {
+        return m_tag_to_register_id.at(tag_id);
+    }
+    return std::nullopt;

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 27ef30b and 2d77ee6.

📒 Files selected for processing (2)

src/log_surgeon/Lexer.hpp (5 hunks)
tests/test-lexer.cpp (6 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-lexer.cpp
src/log_surgeon/Lexer.hpp

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: build (ubuntu-22.04, Release)
GitHub Check: build (ubuntu-22.04, Debug)

🔇 Additional comments (6)

src/log_surgeon/Lexer.hpp (2)

136-142: LGTM!

The function is well-documented and uses safe map access methods.

178-197: LGTM!

The function is well-documented, follows coding guidelines, and uses modern C++ features effectively.

tests/test-lexer.cpp (4)

99-136: LGTM!

The function is well-documented, follows coding guidelines, and uses safe map operations.

138-163: LGTM! Note the TODO comment about register values.

The function is well-documented and has comprehensive assertions. The TODO comment indicates that additional tests for register values will be needed when simulation is implemented.

Please track the TODO comment at line 162 to ensure it's addressed when simulation is implemented.

307-321: LGTM!

The test case effectively verifies basic lexer functionality with both successful and unsuccessful matches.

323-361: LGTM! Note the TODO comment about TDFA's determinization.

The test case effectively verifies lexer functionality with capture groups. The TODO comment indicates that additional tests for register-related functions will be needed when TDFA's determinization is implemented.

Please track the TODO comment at lines 355-356 to ensure it's addressed when TDFA's determinization is implemented.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

tests/test-lexer.cpp (5)
104-108: Simplify the delimiter collection loop using a range-based for loop.

The current loop can be simplified for better readability.
-    for (uint32_t i{0}; i < log_surgeon::cSizeOfByte; ++i) {
-        if (lexer.is_delimiter(i)) {
-            lexer_delimiters.push_back(i);
-        }
-    }
+    for (uint32_t i : std::views::iota(0U, log_surgeon::cSizeOfByte)) {
+        if (lexer.is_delimiter(i)) {
+            lexer_delimiters.push_back(i);
+        }
+    }
70-71: Fix typo in function documentation.

There's a typo in the documentation: "inptut" should be "input".
- * end of inptut token.
+ * end of input token.
162-162: Enhance TODO comment clarity.

The TODO comment should be more descriptive about what needs to be checked and why.
-    // TODO: add check for register values when simulation is implemented.
+    // TODO: Add verification of register values after implementing the NFA simulation to ensure correct state tracking.
355-356: Enhance TODO comment clarity in capture groups test.

The TODO comment should be more descriptive about what needs to be checked and why.
-    // TODO: Add check for `get_reg_id_from_tag_id` and `get_reg_ids_from_capture_id` when TDFA's
-    // determinization is implemented.
+    // TODO: Add verification of register ID mappings after implementing TDFA determinization to ensure
+    // correct register allocation for capture groups.
62-67: Enhance function documentation.

The documentation could be more detailed about:

The purpose of constant delimiters

The expected state of the lexer after initialization

Any side effects or assumptions
 /**
  * Initializes the lexer with the constant delimiters and the given schema.
+ * The constant delimiters (space and newline) are used to separate tokens in the input.
+ * The lexer's symbol mappings are initialized based on the schema variables.
+ * Assumes the lexer is in a clean state before initialization.
  * @param schema Contains the variables to add to the lexer.
  * @param lexer Returns the initialized parser.
  */

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2d77ee6 and 40cab9d.

📒 Files selected for processing (2)

src/log_surgeon/finite_automata/RegexAST.hpp (21 hunks)
tests/test-lexer.cpp (6 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-lexer.cpp
src/log_surgeon/finite_automata/RegexAST.hpp

🧠 Learnings (1)

src/log_surgeon/finite_automata/RegexAST.hpp (3)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

🪛 GitHub Actions: lint

src/log_surgeon/finite_automata/RegexAST.hpp

[error] 712-712: code should be clang-formatted

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: build (ubuntu-22.04, Release)
GitHub Check: build (ubuntu-22.04, Debug)

🔇 Additional comments (8)

tests/test-lexer.cpp (1)

307-321: Consider adding edge cases to basic lexer tests.

The current test case only verifies basic token matching. Consider adding tests for:

Empty input

Input with multiple delimiters

Input with special characters
src/log_surgeon/finite_automata/RegexAST.hpp (7)

23-23: LGTM! Documentation updates align with the new capture-based system.

The header file change and documentation updates accurately reflect the transition from tags to captures.

Also applies to: 33-38

86-102: LGTM! Methods updated consistently to use captures.

The changes maintain the same functionality while transitioning from tags to captures, with consistent naming conventions.

Also applies to: 104-106, 115-117

633-634: LGTM! Documentation clearly states non-null requirements.

The documentation accurately reflects the requirements for both m_capture and m_capture_regex_ast.

648-664: Move exception handling from initializer list to constructor body.

Throwing exceptions in the member initializer list can lead to issues with object construction and exception safety.

Apply this diff to move the null checks into the constructor body:
-            : m_capture_regex_ast{(
-                      nullptr == capture_regex_ast
-                              ? throw std::invalid_argument("Group regex AST cannot be null")
-                              : std::move(capture_regex_ast)
-              )},
-              m_capture{
-                      nullptr == capture ? throw std::invalid_argument("Capture cannot be null")
-                                         : std::move(capture)
-              } {
+            {
+            if (nullptr == capture_regex_ast) {
+                throw std::invalid_argument("Group regex AST cannot be null");
+            }
+            if (nullptr == capture) {
+                throw std::invalid_argument("Capture cannot be null");
+            }
+            m_capture_regex_ast = std::move(capture_regex_ast);
+            m_capture = std::move(capture);

             RegexAST<TypedNfaState>::set_subtree_positive_captures(
                     m_capture_regex_ast->get_subtree_positive_captures()
             );
             RegexAST<TypedNfaState>::add_subtree_positive_captures({m_capture.get()});
         }
668-673: LGTM! Copy constructor correctly handles deep copying.

The copy constructor properly clones both the regex AST and capture objects, maintaining the integrity of the capture hierarchy.

914-934: LGTM! NFA construction diagram and code updated consistently.

The diagram and implementation correctly reflect the new capture-based transition system.

Also applies to: 941-941, 947-947

953-960: LGTM! Serialization correctly handles capture names.

The serialization logic properly converts capture names to UTF-32 strings and maintains consistent formatting.

src/log_surgeon/finite_automata/RegexAST.hpp

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

src/log_surgeon/Lexer.tpp (1)

408-408: Address the TODO comment about DFA ignoring captures.

The TODO comment indicates that DFA treats capture groups differently than intended. This could lead to incorrect parsing behavior.

Do you want me to help track this issue by opening a new issue?
src/log_surgeon/finite_automata/TaggedTransition.hpp (1)
17-17: Documentation needs update.

The comment should be updated to reflect that a tag ID (not a tag) has been matched.
- * Represents an NFA transition indicating that a tag has been matched.
+ * Represents an NFA transition indicating that a tag ID has been matched.
src/log_surgeon/finite_automata/RegexAST.hpp (2)
633-634: Documentation needs update.

The note about non-null expectations should be more specific.
- * - `m_capture` is always expected to be non-null.
- * - `m_capture_regex_ast` is always expected to be non-null.
+ * - `m_capture` must be non-null as it represents the capture group being matched.
+ * - `m_capture_regex_ast` must be non-null as it contains the regex pattern for the capture group.
916-916: Update diagram comments.

The diagram comments should reflect the transition from tags to captures.
-                    | `m_capture` start
+                    | `m_capture` start ID
-                    | `m_capture` end
-                    | (positive tagged end transition)
+                    | `m_capture` end ID
+                    | (positive capture end transition)
Also applies to: 935-936

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 40cab9d and ef3bc8e.

📒 Files selected for processing (9)

CMakeLists.txt (2 hunks)
src/log_surgeon/Aliases.hpp (1 hunks)
src/log_surgeon/Lexer.hpp (5 hunks)
src/log_surgeon/Lexer.tpp (3 hunks)
src/log_surgeon/finite_automata/Nfa.hpp (3 hunks)
src/log_surgeon/finite_automata/NfaState.hpp (3 hunks)
src/log_surgeon/finite_automata/RegexAST.hpp (21 hunks)
src/log_surgeon/finite_automata/TaggedTransition.hpp (3 hunks)
tests/CMakeLists.txt (2 hunks)

✅ Files skipped from review due to trivial changes (1)

src/log_surgeon/Aliases.hpp

🚧 Files skipped from review as they are similar to previous changes (3)

tests/CMakeLists.txt
src/log_surgeon/finite_automata/NfaState.hpp
src/log_surgeon/finite_automata/Nfa.hpp

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/TaggedTransition.hpp
src/log_surgeon/Lexer.hpp
src/log_surgeon/finite_automata/RegexAST.hpp

🧠 Learnings (2)

src/log_surgeon/finite_automata/TaggedTransition.hpp (2)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.

src/log_surgeon/finite_automata/RegexAST.hpp (3)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

🔇 Additional comments (17)

src/log_surgeon/Lexer.hpp (7)

7-7: LGTM!

The added includes are necessary for the new functionality.

Also applies to: 10-10, 13-13

130-136: LGTM!

The function uses safe map access patterns and follows the coding guidelines.

143-150: LGTM!

The function uses safe map access patterns and follows the coding guidelines.

158-165: LGTM!

The function uses safe map access patterns and follows the coding guidelines.

172-191: LGTM!

The function uses safe map access patterns, follows the coding guidelines, and has proper error handling.

193-194: Move public member variables to private scope.

These member variables should be private to maintain encapsulation. This issue is tracked in #93.

219-221: LGTM!

The member variables use appropriate types and follow naming conventions.

src/log_surgeon/Lexer.tpp (3)

7-7: LGTM!

The added include is necessary for throwing std::invalid_argument.

363-364: LGTM!

The method signatures now use the more specific rule_id_t type instead of uint32_t.

Also applies to: 370-371

382-407: LGTM!

The method properly processes captures and their IDs, uses safe map access patterns, follows the coding guidelines, and has proper error handling.

CMakeLists.txt (1)

65-86: LGTM! Source files list updated correctly.

The changes properly reflect the transition from tag-based to capture-based system by:

Adding new files (Aliases.hpp, Capture.hpp) for the new functionality

Removing obsolete files (Tag.hpp)

Maintaining correct file order

Also applies to: 111-111

src/log_surgeon/finite_automata/TaggedTransition.hpp (3)

8-9: LGTM! Header includes updated correctly.

Added necessary headers for the new implementation:

<utility> for std::move

<vector> for std::vector

log_surgeon/Aliases.hpp for tag_id_t

Also applies to: 13-13

23-25: LGTM! PositiveTaggedTransition updated correctly.

The changes properly transition from Tag pointers to tag IDs:

Constructor now takes tag_id_t

Member variable uses tag_id_t

Also applies to: 44-44

55-57: LGTM! NegativeTaggedTransition updated correctly.

The changes properly transition from Tag pointers to tag IDs:

Constructor now takes vector of tag_id_t

Member variable uses vector<tag_id_t>

Correctly uses std::move for efficiency

Also applies to: 76-76

src/log_surgeon/finite_automata/RegexAST.hpp (3)

23-23: LGTM! Header includes updated correctly.

Replaced Tag.hpp with Capture.hpp to reflect the new implementation.

33-38: LGTM! Documentation updated correctly.

The class documentation properly reflects the transition from tags to captures.

86-101: LGTM! Base class methods updated correctly.

Methods properly handle Capture pointers instead of Tag pointers:

get_subtree_positive_captures

set_subtree_positive_captures

add_subtree_positive_captures

src/log_surgeon/finite_automata/RegexAST.hpp

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/log_surgeon/finite_automata/RegexAST.hpp (2)
118-118: Consider using empty() instead of comparing with false.

The condition false == m_negative_captures.empty() can be simplified to !m_negative_captures.empty() for better readability.
-        if (false == m_negative_captures.empty()) {
+        if (!m_negative_captures.empty()) {
910-940: Update diagram comments to reflect capture-based terminology.

The diagram's comments still use tag-based terminology in some places. Consider updating them to consistently use capture-based terminology.
-    // The NFA constructed for a capture group follows the structure below, with tagged transitions
+    // The NFA constructed for a capture group follows the structure below, with capture transitions
-    //                    | (positive tagged start transition)
+    //                    | (positive capture start transition)
-    //                    | (negative tagged transition)
+    //                    | (negative capture transition)
-    //                    | (positive tagged end transition)
+    //                    | (positive capture end transition)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5e2e53 and ddae224.

📒 Files selected for processing (3)

src/log_surgeon/Lexer.hpp (5 hunks)
src/log_surgeon/finite_automata/RegexAST.hpp (21 hunks)
src/log_surgeon/finite_automata/TaggedTransition.hpp (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

src/log_surgeon/finite_automata/TaggedTransition.hpp
src/log_surgeon/Lexer.hpp

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexAST.hpp

🧠 Learnings (1)

src/log_surgeon/finite_automata/RegexAST.hpp (3)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: build (ubuntu-22.04, Release)
GitHub Check: build (ubuntu-22.04, Debug)

🔇 Additional comments (2)

src/log_surgeon/finite_automata/RegexAST.hpp (2)

23-23: LGTM! Documentation and includes updated to reflect the new capture-based system.

The changes consistently replace tag-based terminology with capture-based terminology throughout the documentation and includes.

Also applies to: 32-38

728-728: LGTM! Consistent use of capture-based serialization across all derived classes.

The changes to use serialize_negative_captures are applied consistently throughout all derived classes.

Also applies to: 745-745, 776-776, 807-807, 840-840, 903-903, 962-962, 1118-1118

LinZhihao-723

Last few comments and we're close to merge.

tests/test-lexer.cpp

src/log_surgeon/Aliases.hpp

SharafMohamed and others added 30 commits December 2, 2024 19:56

Linter.

de58e08

rename to reg_id.

1426179

Rename to reg_id.

3301f14

Use at().

c9b1369

Remove Register class and use uint32_t instead; Rename vers to xxx_re…

e2aee66

…g_id; Remove error checking in favor of using .at().

Rename to reg_id.

36c1810

Remove unused header.

48df8b0

Change pred index to be optional and nullopt for root.

a8605fc

Add and use node_id_t.

15cb1b6

Add position_t.

6b787d0

Change to id_t.

cd8f4e3

Add is_root().

72da50c

Add missing header.

3fc7ea7

Update PrefixTree docstring.

6443d66

Removing node docstring as its redundant.

63aec4d

Combine private section in PrefixTree.

295f3ee

Add missing header; Remove copy paste error.

1186666

Rename to node_id and parent_node_id.

06ee38e

Update get_reversed_positions' docstring.

e103011

Update get_reversed positions' docstring to clarify exlcusivity of th…

31b0346

…e root.

Grammar fix.

4005e41

Add maybe_unusued.

e38940c

Update src/log_surgeon/finite_automata/RegisterHandler.hpp

d71368d

Co-authored-by: Lin Zhihao <[email protected]>

Update test case names to document code names better.

dd4b6e1

Implicitily use auto in vectors.

7322852

Explicitily use position_t for vectors.

dba1a18

Update tests/test-register-handler.cpp

ee6efab

Co-authored-by: Lin Zhihao <[email protected]>

Switch to size_t.

9ba980c

Clang-tidy: Remove magic numbers + Fix headers.

27b324c

Reduce complexity for clang-tidy.

f651a24

coderabbitai bot reviewed Feb 10, 2025

View reviewed changes

Run linter.

2d77ee6

coderabbitai bot reviewed Feb 10, 2025

View reviewed changes

Rename group to capture.

40cab9d

coderabbitai bot reviewed Feb 10, 2025

View reviewed changes

src/log_surgeon/finite_automata/RegexAST.hpp Outdated Show resolved Hide resolved

SharafMohamed added 3 commits February 10, 2025 09:25

Add Aliases.hpp; Lint.

df1c9df

Add aliases.

ef3bc8e

Rename register to reg.

58a948e

coderabbitai bot reviewed Feb 10, 2025

View reviewed changes

src/log_surgeon/finite_automata/RegexAST.hpp Outdated Show resolved Hide resolved

SharafMohamed added 13 commits February 10, 2025 09:40

Remove erroneous const.

481b997

Use try_emplace.

ea724e6

Make sure declaration matches definition.

b326a92

Fix typo.

0693c1b

Rename add_to_nfa_with_negative_captures.

85724cb

Improve TODO.

5261293

Update initialize lexer docstring.

19856a1

Update docstring.

5a72ad5

Improve docstring.

1217fe6

Update diagram with ID.

d5e2e53

Add nullptr checks to body of constructor.

ebb6abc

Use contains in Lexer.hpp.

630749a

Switch find to contains in TaggedTransition.hpp.

ddae224

SharafMohamed requested a review from LinZhihao-723 February 10, 2025 15:36

coderabbitai bot reviewed Feb 10, 2025

View reviewed changes

Add missing header.

a2d1825

LinZhihao-723 requested changes Feb 10, 2025

View reviewed changes

tests/test-lexer.cpp Outdated Show resolved Hide resolved

src/log_surgeon/Aliases.hpp Outdated Show resolved Hide resolved

SharafMohamed added 3 commits February 11, 2025 18:50

Rename Aliases.hpp to types.hpp.

983ddcb

initialize_lexer now creates and returns the lexer.

303de17

Fix compiler error.

9e99f40

SharafMohamed requested a review from LinZhihao-723 February 12, 2025 00:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: separate the concept of captures and tags; lexer now tracks mapping from variables to capture to tags to registers. #72

feat: separate the concept of captures and tags; lexer now tracks mapping from variables to capture to tags to registers. #72

SharafMohamed commented Jan 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

LinZhihao-723 left a comment

feat: separate the concept of captures and tags; lexer now tracks mapping from variables to capture to tags to registers. #72

Are you sure you want to change the base?

feat: separate the concept of captures and tags; lexer now tracks mapping from variables to capture to tags to registers. #72

Conversation

SharafMohamed commented Jan 13, 2025 • edited by coderabbitai bot Loading

References

Description

Validation performed

Summary by CodeRabbit

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

LinZhihao-723 left a comment

Choose a reason for hiding this comment

SharafMohamed commented Jan 13, 2025 •

edited by coderabbitai bot

Loading