- 
                Notifications
    You must be signed in to change notification settings 
- Fork 83
fix(clp-mcp-server): Refine system prompt to make LLM's KQL query generation to be CLP friendly #1494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| WalkthroughReplaced a KQL-focused SYSTEM_PROMPT with a CKQL-centered triple-quoted prompt in  Changes
 Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 
 Possibly related issues
 Pre-merge checks and finishing touches❌ Failed checks (1 warning)
 ✅ Passed checks (2 passed)
 ✨ Finishing touches
 🧪 Generate unit tests (beta)
 📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (1)
 🧰 Additional context used🧠 Learnings (4)📓 Common learnings📚 Learning: 2025-05-07T16:56:35.687ZApplied to files: 
 📚 Learning: 2025-08-15T21:48:40.228ZApplied to files: 
 📚 Learning: 2025-09-04T13:51:02.311ZApplied to files: 
 ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
 🔇 Additional comments (4)
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
- components/clp-mcp-server/clp_mcp_server/server/constants.py(1 hunks)
🧰 Additional context used
🪛 GitHub Actions: clp-lint
components/clp-mcp-server/clp_mcp_server/server/constants.py
[error] 26-26: Q003 [*] Change outer quotes to avoid escaping inner quotes. Found 1 error. 1 fixable with the --fix option.
🪛 Ruff (0.14.1)
components/clp-mcp-server/clp_mcp_server/server/constants.py
26-26: Change outer quotes to avoid escaping inner quotes
Change outer quotes to avoid escaping inner quotes
(Q003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: package-image
🔇 Additional comments (3)
components/clp-mcp-server/clp_mcp_server/server/constants.py (3)
15-16: LGTM: Improved clarity by specifying the database name.The addition of "called CLP" helps clarify the context for the LLM.
33-36: Excellent additions for improving query quality.The guidance to ask clarifying questions when user queries are ambiguous and to ensure syntactically correct queries will help prevent the types of issues described in the PR objective.
31-32: Timestamp format documentation is inaccurate—the implementation accepts variable fractional-second precision, not exactly three digits.The code demonstrates that timestamps with zero, one, or any number of fractional digits are accepted (e.g.,
2024-10-18T16:00:00Zand2024-10-18T16:00:00.123Zboth parse successfully). The parsing relies on Python'sdatetime.fromisoformat(), which is flexible with fractional-second precision. However, the documentation in lines 31–32 specifies the format asYYYY-MM-DDTHH:mm:ss.fffZ(implying exactly three fractional digits), which does not reflect the actual implementation.Update the format specification to clarify that fractional seconds are optional and any ISO 8601 precision is accepted, or add strict validation to enforce exactly three fractional digits if that is the intended requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
components/clp-mcp-server/clp_mcp_server/server/constants.py (2)
23-24: Wildcard guidance addresses the core PR objective.This new instruction directly solves the issue where queries like
register OR registration OR registeredwere failing. The guidance correctly tells the LLM to use*register* OR *registration* OR *registered*for substring matching.However, as noted in a previous review, the wording could be clearer and more concise.
25-26: Fix the linting error and indentation inconsistency.This guidance about quoting multi-word keys/values is valuable for CLP's KQL variant. However, as flagged in previous reviews, there are two formatting issues:
- Critical: The quote style causes a linting error that must be fixed.
- Minor: The bullet indentation is inconsistent with line 23.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
- components/clp-mcp-server/clp_mcp_server/server/constants.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: package-image
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (6)
components/clp-mcp-server/clp_mcp_server/server/constants.py (6)
15-16: LGTM! Clear identification of the database and acronym expansion.Adding "called CLP" and expanding "KQL" to "Kibana Query Language" improves clarity for users who may not be familiar with the acronym.
20-20: LGTM! Improved readability.The blank line appropriately separates the general description from the CLP-specific guidelines section.
21-22: LGTM! Essential context about KQL variant.Explicitly stating that CLP supports a KQL variant provides crucial context for why the following guidelines differ from standard KQL usage patterns.
33-34: LGTM! Promotes better user interaction.Instructing the LLM to ask clarifying questions when the user's intent is unclear is a best practice that will improve the accuracy of generated queries and overall user experience.
35-36: LGTM! Good defensive requirement.While this requirement may seem implicit, explicitly instructing the LLM to ensure syntactic correctness is a good practice that reinforces the importance of generating valid, executable queries.
31-32: Timestamp format specification is accurate and verified.The format
YYYY-MM-DDTHH:mm:ss.fffZwith three-digit millisecond precision is correctly specified. The codebase validation confirms:
- The trailing
Zis mandatory (enforced inparse_timestamp_range()at utils.py:45)- Millisecond precision (
.fff) is consistently documented across constants.py, server.py, and utils.py- The format matches exactly what the timestamp parsing and conversion functions expect
No changes needed.
| @kirkrodrigues @gibber9809 this PR's objective is to explain to the LLM the difference between regular KQL and KQL supported by CLP using as little text/tokens as possible. So one of you should probably take a second look to ensure that the important points are covered. Some considerations: 
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
components/clp-mcp-server/clp_mcp_server/server/constants.py (1)
18-19: Clarify the interaction between specificity and wildcards.The guidance to "be as specific as possible" might appear contradictory with the mandatory wildcard usage (lines 23-24), since wildcards broaden matches. Consider clarifying that "specific" refers to using specific field names and narrowing by time range or field values, while wildcards should still be used for keyword matching as required by CLP.
Example clarification:
-" query should be as specific as possible to minimize the number of log messages returned. When" +" query should use specific field names, values, and time ranges to minimize the number of log" +" messages returned. When"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
- components/clp-mcp-server/clp_mcp_server/server/constants.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: package-image
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (2)
components/clp-mcp-server/clp_mcp_server/server/constants.py (2)
15-16: Good clarification of the target database.The change from "log database using KQL" to "log database called CLP using KQL" makes the prompt more specific and sets the context for the CLP-specific guidance that follows.
21-22: Effective introduction to CLP-specific KQL.This introduction appropriately sets the context that CLP uses a variant of KQL, preparing the LLM for the specific guidelines that follow. This aligns well with the PR objective to teach the LLM about CLP's KQL differences concisely.
| '- CRITICAL: Use double quotation(") to wrap multiple keywords search. For example, to search for:' | ||
| ' hello world, your query parameter should literally be: "*hello world*" (with the quotes as part' | ||
| ' of the search string wrapping around both wildcards and multiple words). Otherwise, the query' | ||
| ' will fail.' | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Clarify the multi-word search syntax with examples.
The guidance on combining double quotes and wildcards for multi-word searches is confusing and may lead to incorrect query generation:
- 
Ambiguous phrasing: "with the quotes as part of the search string wrapping around both wildcards and multiple words" is difficult to parse. Does this mean: - The literal string is "*hello world*"including quote characters?
- The quotes are KQL syntax (not literal) that wrap *hello world*?
 
- The literal string is 
- 
Missing context: The guidance doesn't clarify whether this applies to: - Field-value searches: field:"*multi word value*"
- Free-text phrase searches: "*multi word phrase*"
- Both?
 
- Field-value searches: 
- 
Syntax uncertainty: Is "*hello world*"actually valid CLP KQL syntax? In standard KQL, wildcards and phrase quotes serve different purposes and combining them this way is unusual.
Please verify the correct CLP KQL syntax for multi-word searches and provide clearer examples:
Consider restructuring the guidance to be more explicit:
- CRITICAL: For multi-word searches, combine wildcards and quotes:
  * Field searches: field:"*multi word value*"
  * Phrase searches: "*multi word phrase*"
  * The wildcards go inside the quotes, with quotes used to group the phrase.
🏁 Script executed:
#!/bin/bash
# Check CLP documentation or tests for multi-word wildcard search syntax
rg -n -C3 "multi.*word.*wildcard|wildcard.*phrase|quote.*wildcard" --type=md
rg -n -C3 '".*\*.*\s.*\*.*"' --type=py --type=jsLength of output: 1011
I'll verify the CLP KQL syntax by checking documentation and examining how queries are actually constructed in the codebase.
Let me search for more specific information about CLP unstructured text log syntax (since the constants seem to be for general search guidance):
Now let me check how the MCP server actually constructs and passes queries to understand if this guidance matches the implementation:
Let me search the codebase directly for how queries are constructed and used in the MCP server:Based on my research, I now have sufficient information to provide a comprehensive rewritten review. The key findings are:
CLP requires double quotes to wrap multi-word searches for keys or values, and queries with spaces are interpreted as substring searches. Most crucially, CLP automatically inserts wildcards at the beginning and end of the search string, making it a substring search by default.
The critical issue is that the guidance in the constants file is misleading because it suggests users must manually type "*hello world*" when in reality CLP automatically adds these wildcards. The phrase "with the quotes as part of the search string" conflates two distinct concepts: quoting syntax (for multi-word phrases) and wildcards (for substring matching).
Replace confusing multi-word search guidance with clarity on CLP's automatic wildcard handling.
The guidance on lines 25–28 conflates quoting syntax with wildcard syntax, leading to an incorrect example. CLP automatically inserts wildcards at the beginning and end of search strings, so users do not need to type "*hello world*" manually. The correct guidance should clarify that:
- For multi-word phrases, use double quotes: "hello world"
- CLP handles substring matching automatically without explicit wildcards
- Wildcards are optional if the user wants to specify exact positions
Rewrite to:
'- CRITICAL: Use double quotation marks to wrap multiple-word searches. For example:'
' "hello world" will match log entries containing the phrase hello world.'
' CLP automatically performs substring matching, so you do not need to manually'
' add wildcards (*) unless you want to control their position.'
| # fmt: off | ||
| SYSTEM_PROMPT = ( | ||
| "You are an AI assistant that helps users query a log database using KQL (Kibana Query Language)." | ||
| "You are an AI assistant that helps users query a log database called CLP using KQL" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what I came up with:
You are an AI assistant that helps users query a log database, called CLP, by using a variant of KQL
(Kibana Query Language) that we'll call CLP-KQL.
You should generate a CLP-KQL query that accurately expresses the user's intent. The generated query
should be as specific as possible to minimize the number of log messages returned. When displaying
log messages, wrap them in hyperlinks with the link field from the search result.
Consider the following guidelines when generating CLP-KQL queries:
- To search for a substring (a partial match), use the wildcards *and?.
- To query a key or a value that contains multiple words separated by spaces, wrap the words in
 double quotation marks.
- In keys, to search for any of the literal characters \,",.,*,@,$,!, and#, escape
 them with a backslash.
- In values, to search for any of the literal characters \,",?, and*, escape them with a
 backslash.
- Consider using search_by_kql_with_timestamp_rangeto narrow the query to a specific time range.
- CLP doesn't support fuzzy matches.
- CLP doesn't support less/greater than comparisons on strings, IP addresses, or timestamps.
- If the user's prompt is vague, ask clarifying questions to better understand their intent before
 generating the CLP-KQL query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have problems with the rest of the points except the first two because we have specifically refined those two in order to make AI generate correct kql.
We have iterated many versions and only the current version ensures AI could understand and follow the rule.  The takeaway from the experiment was that in order to mark a critical rule effective, we need to be:
- marking important rule with CRITICAL,
- explaining the rule with a concrete example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can you write a revision?
- Can we add a comment above the prompt explaining what characteristics need to be maintained based on your experiments? Otherwise the next person to edit this likely won't know either, lol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the second point and will add the rule as the comment right above the constant instantiation. But let's first make an agreement on the actual system prompt string.
Here is my revision:
You are an AI assistant that helps users query a log database, called CLP, by using a variant of KQL
(Kibana Query Language) that we'll call CLP-KQL.
You should generate a CLP-KQL query that accurately expresses the user's intent. The generated query
should be as specific as possible to minimize the number of log messages returned. When displaying
log messages, wrap them in hyperlinks with the link field from the search result.
Consider the following guidelines when generating CLP-KQL queries, and pay special attention to rules marked as CRITICAL - violation will cause query failure:
- CRITICAL: To search for a substring keyword (a partial match), you MUST use wildcards (* or ?) at the beginning and end of your keyword.
- CRITICAL: Use double quotation(") to wrap multiple keywords search that are separated by spaces. For example, to search for all logs containing: hello world, your query parameter should literally be: "hello world" (with the quotes as part of the search string wrapping around both wildcards and multiple words).
- In keys, to search for any of the literal characters , ", ., *, @, $, !, and #, escape them with a backslash.
- In values, to search for any of the literal characters , ", ?, and *, escape them with a backslash.
- Consider using search_by_kql_with_timestamp_range to narrow the query to a specific time range. Timestamps must follow the ISO 8601 UTC format (YYYY-MM-DDTHH:mm:ss.fffZ), where the trailingZindicates UTC.
- CLP doesn't support fuzzy matches.
- CLP doesn't support less/greater than comparisons on strings, IP addresses, or timestamps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are my rationales behind my edits, a key point that I would like to emphasize is that we would like this system prompt to be AI friendly, even if this might not be stricitly following a documentation style for human readers
- CRITICAL Rule Marking: LLMs do not guarantee enforcement of all system prompt rules—this was confirmed through empirical testing. Rules 1 and 2 are essential for generating correct CLP-KQL queries; without them, LLMs produce invalid syntax like message: hello world (missing wildcards and quotes). Marking these rules as "CRITICAL" significantly improves KQL success rate, and ensuring LLMs prioritize them even when other rules may be ignored.
- Action item first and example showing: What I change in the second point was I move the action item for AI to the beginning of the sentence. I found that effective in the experiement. Same argument goes with example.
- Definition of exact format of timestamp: query with timestamp that does not follows ISO 8601 in UTC format will be rejected, so we want to make sure AI generates the exact format
- Removing your last point (also this is what I have in my original commit): i feel this is something good to have. However, I found that system prompt's length directly impacts LLM attention, as we don't want LLM loses its focus on CRITICAL points by some minor rules or reminder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 2nd point mentioned by @kirkrodrigues is updated in the pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded in the design doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
- components/clp-mcp-server/clp_mcp_server/server/constants.py(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: junhaoliao
PR: y-scope/clp#1405
File: components/clp-package-utils/pyproject.toml:5-15
Timestamp: 2025-10-13T03:24:35.074Z
Learning: In the y-scope/clp repository, the Python 3.9 to 3.10 version requirement change was intentionally deferred to a separate PR (after PR #1405) to reduce review effort, as decided in an offline discussion between junhaoliao and kirkrodrigues.
📚 Learning: 2025-05-07T16:56:35.687Z
Learnt from: LinZhihao-723
PR: y-scope/clp#882
File: components/core/src/clp/ffi/ir_stream/search/QueryHandlerImpl.cpp:378-402
Timestamp: 2025-05-07T16:56:35.687Z
Learning: In the CLP search component, the `evaluate_wildcard_filter` function should return `AstEvaluationResult::Pruned` when `node_id_value_pairs` is empty, not `AstEvaluationResult::False`. Empty node sets should be treated as "undetermined" rather than definitive non-matches.
Applied to files:
- components/clp-mcp-server/clp_mcp_server/server/constants.py
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.
Applied to files:
- components/clp-mcp-server/clp_mcp_server/server/constants.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: package-image
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (4)
components/clp-mcp-server/clp_mcp_server/server/constants.py (4)
14-20: Excellent meta-documentation for prompt design.This comment directly addresses the maintainability concern raised in the review discussion and documents the empirically validated practices. Future editors will understand why critical rules are marked and structured as they are.
28-30: Core wildcard rule addresses the PR's primary objective.This CRITICAL rule with concrete example directly solves the issue described in the PR objectives where
registerfailed but*register*succeeded. The format follows the documented best practices and has been empirically validated.
42-45: Escaping rules are clear and well-differentiated.The distinction between key and value escaping rules is clearly documented, helping LLMs generate syntactically correct queries.
49-50: Unsupported features clearly documented.Explicitly listing unsupported features helps prevent LLMs from generating invalid queries.
| SYSTEM_PROMPT = """You are an AI assistant for querying the CLP log database using CLP-KQL (CKQL).\n | ||
| Your job is to generate CKQL that faithfully expresses the user's intent and show key logs to the user | ||
| - Start broad to learn the schema/fields using wildcard searches like *, then narrow the query to return a manageable result set. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Clarify the wildcard example for schema exploration.
The phrase "wildcard searches like *" could be ambiguous. Does this mean:
- A literal query *to match all logs?
- Using wildcards in general (e.g., *error*,*warn*)?
Consider rephrasing to make the intent explicit, such as: "Start broad to explore the schema/fields (e.g., use * to see all logs or *error* to find error-related logs), then narrow the query..."
🤖 Prompt for AI Agents
In components/clp-mcp-server/clp_mcp_server/server/constants.py around line 24,
the comment "Start broad to learn the schema/fields using wildcard searches like
*, then narrow the query to return a manageable result set." is ambiguous about
what "wildcard searches like *" means; update the comment to explicitly show
examples (e.g., use `*` to match all logs or `*error*` to find error-related
logs) and rephrase to: "Start broad to explore schema/fields (e.g., use `*` to
see all logs or `*error*` to find error-related logs), then narrow the query to
a manageable result set." Ensure the new text replaces the old line exactly and
follows the project's comment style.
| - CRITICAL -- Multi-word text must be quoted: wrap multi-word searches in double quotes. | ||
| Example: | ||
| request: "*GET wp-admin*" | ||
| (quotes and wildcards are required). | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Multi-word rule is clear and empirically validated.
The example correctly demonstrates the combination of quotes and wildcards for multi-word substring searches. This format has been validated through testing.
Optional: Consider changing line 39 from "(quotes and wildcards are required)" to "(both quotes and wildcards are required)" to more explicitly emphasize the combination.
🤖 Prompt for AI Agents
In components/clp-mcp-server/clp_mcp_server/server/constants.py around lines 36
to 39, the comment example is fine but line 39's parenthetical phrasing "(quotes
and wildcards are required)" should be changed to "(both quotes and wildcards
are required)" for clarity; update that exact text to the new wording while
preserving the surrounding example and punctuation.
| - Time range: use search_by_kql_with_timestamp_range to constrain by time.\n | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing critical timestamp format specification.
In the review discussion, you explicitly stated (point #3 in rationale) that timestamp format is critical: "query with timestamp that does not follow ISO 8601 in UTC format will be rejected." You proposed including: "Timestamps must follow the ISO 8601 UTC format (`YYYY-MM-DDTHH:mm:ss.fffZ`), where the trailing `Z` indicates UTC."
This specification is missing from the current version. Without it, LLMs may generate timestamps in incorrect formats, causing query failures.
Consider adding the timestamp format specification:
-- Time range: use search_by_kql_with_timestamp_range to constrain by time.\n
+- Time range: use search_by_kql_with_timestamp_range to constrain by time.
+  Timestamps must follow ISO 8601 UTC format: YYYY-MM-DDTHH:mm:ss.fffZ (the trailing Z indicates UTC).
+  Example: 2024-10-15T08:30:00.000Z\n🤖 Prompt for AI Agents
In components/clp-mcp-server/clp_mcp_server/server/constants.py around lines
47-48, the timestamp format requirement is missing; add a clear constant or
comment stating that all timestamps must follow the ISO 8601 UTC format
"YYYY-MM-DDTHH:mm:ss.fffZ" (with fractional seconds and trailing Z indicating
UTC) and update any related docstring or constant used by
search_by_kql_with_timestamp_range to reference this exact format so generated
queries and validations enforce/rely on that format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
- components/clp-mcp-server/clp_mcp_server/server/constants.py(1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: junhaoliao
PR: y-scope/clp#1405
File: components/clp-package-utils/pyproject.toml:5-15
Timestamp: 2025-10-13T03:24:35.074Z
Learning: In the y-scope/clp repository, the Python 3.9 to 3.10 version requirement change was intentionally deferred to a separate PR (after PR #1405) to reduce review effort, as decided in an offline discussion between junhaoliao and kirkrodrigues.
📚 Learning: 2025-05-07T16:56:35.687Z
Learnt from: LinZhihao-723
PR: y-scope/clp#882
File: components/core/src/clp/ffi/ir_stream/search/QueryHandlerImpl.cpp:378-402
Timestamp: 2025-05-07T16:56:35.687Z
Learning: In the CLP search component, the `evaluate_wildcard_filter` function should return `AstEvaluationResult::Pruned` when `node_id_value_pairs` is empty, not `AstEvaluationResult::False`. Empty node sets should be treated as "undetermined" rather than definitive non-matches.
Applied to files:
- components/clp-mcp-server/clp_mcp_server/server/constants.py
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.
Applied to files:
- components/clp-mcp-server/clp_mcp_server/server/constants.py
📚 Learning: 2025-09-04T13:51:02.311Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1283
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:206-210
Timestamp: 2025-09-04T13:51:02.311Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/compress.py, an empty string ("") is a valid value for the --timestamp-key parameter, so validation should not treat empty strings as missing values.
Applied to files:
- components/clp-mcp-server/clp_mcp_server/server/constants.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: package-image
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (2)
components/clp-mcp-server/clp_mcp_server/server/constants.py (2)
37-41: CRITICAL wildcard rule effectively addresses the PR objective.The wildcard guidance is clearly marked as CRITICAL and includes a concrete example. This directly solves the problem described in the PR where queries like
registerfailed but*register*would succeed. The action-first structure and terse language align with the empirical guidelines.
47-52: Multi-word search rule is clear and well-validated.The CRITICAL marking, concrete example, and explicit note that "quotes and wildcards are required" effectively communicate the CLP-specific syntax. This aligns with the empirically validated format from your testing.
| - Time range: use search_by_kql_with_timestamp_range | ||
| to constrain by time.\n | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add missing timestamp format specification.
In the review discussion, you explicitly stated that timestamp format is critical: "query with timestamp that does not follow ISO 8601 in UTC format will be rejected." You proposed including the exact format specification, and this was flagged as a major issue in past reviews. Without this specification, LLMs may generate incorrectly formatted timestamps, causing query failures.
Apply this diff to add the timestamp format specification:
-- Time range: use search_by_kql_with_timestamp_range
- to constrain by time.\n
+- Time range: use search_by_kql_with_timestamp_range to constrain by time.
+  Timestamps must follow ISO 8601 UTC format: YYYY-MM-DDTHH:mm:ss.fffZ
+  (the trailing Z indicates UTC).
+  Example: 2024-10-15T08:30:00.000Z\n🤖 Prompt for AI Agents
In components/clp-mcp-server/clp_mcp_server/server/constants.py around lines
61-62, the comment about time range lacks the required timestamp format; update
the comment to explicitly state that timestamps must be ISO 8601 in UTC and give
the exact expected format and example (e.g. YYYY-MM-DDTHH:MM:SS.sssZ) so callers
and LLMs will produce valid timestamps; modify the string/documentation there to
append this specification and an example, and run a quick search to ensure any
related helper names (search_by_kql_with_timestamp_range) and docs are
consistent with the added format text.
Description
With our current system prompt, LLM would generate kql such as
kql_query:register OR registration OR registered.This would result
{"Error":"No log events found matching the KQL query."}or{"Error":"Query job with ID 15 ended in status FAILED."}pending the exact query submitted.The reason why this happens is that the kql generated by LLM is not CLP-compatible; CLP expects a kql query that looks like this, following the convention specified in here.
kql_query:*register* OR *registration* OR *registered*.This PR refines the system prompt to ensure LLM understands the rule of CLP's variant of KQL, thereby improving the efficiency of the diagnosis.
Checklist
breaking change.
Validation performed
We have carefully designed the system prompt and evaluated on two real-world cases: HDFS-4058 and Hadoop-10508. We have run each case three times, and the AI-generated kql reaches 100% accuracy (that is each query submitted wound't be killed or fail). HDFS-4058 makes 5 kql queries per time, and Hadoop-10508 makes 3 kql queires per time.
Summary by CodeRabbit
New Features
Documentation