Skip to content

Conversation

@20001020ycx
Copy link
Contributor

@20001020ycx 20001020ycx commented Oct 24, 2025

Description

With our current system prompt, LLM would generate kql such as kql_query: register OR registration OR registered.
This would result {"Error":"No log events found matching the KQL query."} or {"Error":"Query job with ID 15 ended in status FAILED."} pending the exact query submitted.

The reason why this happens is that the kql generated by LLM is not CLP-compatible; CLP expects a kql query that looks like this, following the convention specified in here.
kql_query: *register* OR *registration* OR *registered*.

This PR refines the system prompt to ensure LLM understands the rule of CLP's variant of KQL, thereby improving the efficiency of the diagnosis.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

We have carefully designed the system prompt and evaluated on two real-world cases: HDFS-4058 and Hadoop-10508. We have run each case three times, and the AI-generated kql reaches 100% accuracy (that is each query submitted wound't be killed or fail). HDFS-4058 makes 5 kql queries per time, and Hadoop-10508 makes 3 kql queires per time.

Summary by CodeRabbit

  • New Features

    • Query guidance replaced with a CKQL-focused prompt, introducing explicit CRITICAL rules, CKQL terminology, and stricter wildcard/quoting behavior to reduce ambiguous matches.
  • Documentation

    • Clarified time-range usage and ISO 8601 UTC timestamp expectations.
    • Expanded guidance on quoting, escaping and CKQL-specific constraints to improve query reliability and predictability.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 24, 2025

Walkthrough

Replaced a KQL-focused SYSTEM_PROMPT with a CKQL-centered triple-quoted prompt in components/clp-mcp-server/clp_mcp_server/server/constants.py, adding explicit CRITICAL rules for wildcard usage, quoting/escaping guidance, and ISO 8601 UTC Z timestamp guidance.

Changes

Cohort / File(s) Change Summary
System prompt constant
components/clp-mcp-server/clp_mcp_server/server/constants.py
Replaced the previous multi-line KQL guidance string with a consolidated CKQL-focused SYSTEM_PROMPT triple-quoted string. Introduced CKQL terminology, CRITICAL rules (wildcards required at both ends for substring searches), double-quoting for multi-word keys/values, escaping rules, ISO 8601 UTC Z time guidance, and reformatted examples/rule structure.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Review focus:
    • Validate exact prompt wording, punctuation, and intended CRITICAL constraints.
    • Check string quoting/escaping and line breaks do not introduce runtime issues.
    • Confirm no other constants or identifiers were unintentionally modified.

Possibly related issues

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "fix(clp-mcp-server): Refine system prompt to make LLM's KQL query generation to be CLP friendly" directly aligns with the main change in the changeset. The raw summary confirms the system prompt was refined by replacing old KQL guidance with CLP-based rules, and the PR objectives detail how the refined prompt teaches LLMs to generate CLP-compatible KQL queries with proper wildcarding. The title is specific and clear—a developer scanning the history would immediately understand the primary change involves improving the system prompt to ensure LLM-generated queries work correctly with CLP's KQL variant. The title avoids vague terminology and accurately reflects the changeset content without being misleading.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a2c493b and 384ffb7.

📒 Files selected for processing (1)
  • components/clp-mcp-server/clp_mcp_server/server/constants.py (1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: junhaoliao
PR: y-scope/clp#1405
File: components/clp-package-utils/pyproject.toml:5-15
Timestamp: 2025-10-13T03:24:35.074Z
Learning: In the y-scope/clp repository, the Python 3.9 to 3.10 version requirement change was intentionally deferred to a separate PR (after PR #1405) to reduce review effort, as decided in an offline discussion between junhaoliao and kirkrodrigues.
📚 Learning: 2025-05-07T16:56:35.687Z
Learnt from: LinZhihao-723
PR: y-scope/clp#882
File: components/core/src/clp/ffi/ir_stream/search/QueryHandlerImpl.cpp:378-402
Timestamp: 2025-05-07T16:56:35.687Z
Learning: In the CLP search component, the `evaluate_wildcard_filter` function should return `AstEvaluationResult::Pruned` when `node_id_value_pairs` is empty, not `AstEvaluationResult::False`. Empty node sets should be treated as "undetermined" rather than definitive non-matches.

Applied to files:

  • components/clp-mcp-server/clp_mcp_server/server/constants.py
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • components/clp-mcp-server/clp_mcp_server/server/constants.py
📚 Learning: 2025-09-04T13:51:02.311Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1283
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:206-210
Timestamp: 2025-09-04T13:51:02.311Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/compress.py, an empty string ("") is a valid value for the --timestamp-key parameter, so validation should not treat empty strings as missing values.

Applied to files:

  • components/clp-mcp-server/clp_mcp_server/server/constants.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (4)
components/clp-mcp-server/clp_mcp_server/server/constants.py (4)

14-27: Guidelines are well-structured and empirically validated.

The numbered guidelines (1-7) are clear, sequential, and based on empirical testing as noted in the PR objectives. These practices provide valuable context for future maintainers editing the system prompt.


27-35: Clear job definition follows documented guidelines.

The introduction and job description effectively communicate the AI's role and follow guideline #4 by placing functional behavior (generating CKQL and showing logs via hyperlinks) early in the prompt.


43-45: Consider adding wildcards to the combining conditions example for consistency.

The combining conditions example shows request: GET AND response: 400 without wildcards, which contrasts with the CRITICAL rule example on line 41 (request: *GET*) that uses wildcards. This inconsistency could confuse the LLM about when wildcards are required.

If substring matching is the typical use case (as suggested by the PR objective), consider updating the example to maintain consistency:

-  request: GET AND response: 400\n
+  request: *GET* AND response: *400*\n

Alternatively, if this example intentionally demonstrates exact matching, add a brief clarification (e.g., "for exact matches" or "without wildcards for exact match").


54-65: Escaping rules and unsupported features are clear.

The escaping rules clearly differentiate between keys and values, and the unsupported features section appropriately sets LLM expectations about CLP-KQL limitations. These rules follow the guideline of using terse language and bullet points.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@20001020ycx 20001020ycx marked this pull request as ready for review October 24, 2025 18:48
@20001020ycx 20001020ycx requested a review from a team as a code owner October 24, 2025 18:48
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eb7f0d5 and f5e8e30.

📒 Files selected for processing (1)
  • components/clp-mcp-server/clp_mcp_server/server/constants.py (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: clp-lint
components/clp-mcp-server/clp_mcp_server/server/constants.py

[error] 26-26: Q003 [*] Change outer quotes to avoid escaping inner quotes. Found 1 error. 1 fixable with the --fix option.

🪛 Ruff (0.14.1)
components/clp-mcp-server/clp_mcp_server/server/constants.py

26-26: Change outer quotes to avoid escaping inner quotes

Change outer quotes to avoid escaping inner quotes

(Q003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: package-image
🔇 Additional comments (3)
components/clp-mcp-server/clp_mcp_server/server/constants.py (3)

15-16: LGTM: Improved clarity by specifying the database name.

The addition of "called CLP" helps clarify the context for the LLM.


33-36: Excellent additions for improving query quality.

The guidance to ask clarifying questions when user queries are ambiguous and to ensure syntactically correct queries will help prevent the types of issues described in the PR objective.


31-32: Timestamp format documentation is inaccurate—the implementation accepts variable fractional-second precision, not exactly three digits.

The code demonstrates that timestamps with zero, one, or any number of fractional digits are accepted (e.g., 2024-10-18T16:00:00Z and 2024-10-18T16:00:00.123Z both parse successfully). The parsing relies on Python's datetime.fromisoformat(), which is flexible with fractional-second precision. However, the documentation in lines 31–32 specifies the format as YYYY-MM-DDTHH:mm:ss.fffZ (implying exactly three fractional digits), which does not reflect the actual implementation.

Update the format specification to clarify that fractional seconds are optional and any ISO 8601 precision is accepted, or add strict validation to enforce exactly three fractional digits if that is the intended requirement.

@20001020ycx 20001020ycx changed the title fix<clp-mcp-server>: Refine system prompt to improve LLM's KQL query generation to be CLP friendly fix(clp-mcp-server): Refine system prompt to improve LLM's KQL query generation to be CLP friendly Oct 24, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
components/clp-mcp-server/clp_mcp_server/server/constants.py (2)

23-24: Wildcard guidance addresses the core PR objective.

This new instruction directly solves the issue where queries like register OR registration OR registered were failing. The guidance correctly tells the LLM to use *register* OR *registration* OR *registered* for substring matching.

However, as noted in a previous review, the wording could be clearer and more concise.


25-26: Fix the linting error and indentation inconsistency.

This guidance about quoting multi-word keys/values is valuable for CLP's KQL variant. However, as flagged in previous reviews, there are two formatting issues:

  1. Critical: The quote style causes a linting error that must be fixed.
  2. Minor: The bullet indentation is inconsistent with line 23.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f5e8e30 and df3e46e.

📒 Files selected for processing (1)
  • components/clp-mcp-server/clp_mcp_server/server/constants.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: package-image
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (6)
components/clp-mcp-server/clp_mcp_server/server/constants.py (6)

15-16: LGTM! Clear identification of the database and acronym expansion.

Adding "called CLP" and expanding "KQL" to "Kibana Query Language" improves clarity for users who may not be familiar with the acronym.


20-20: LGTM! Improved readability.

The blank line appropriately separates the general description from the CLP-specific guidelines section.


21-22: LGTM! Essential context about KQL variant.

Explicitly stating that CLP supports a KQL variant provides crucial context for why the following guidelines differ from standard KQL usage patterns.


33-34: LGTM! Promotes better user interaction.

Instructing the LLM to ask clarifying questions when the user's intent is unclear is a best practice that will improve the accuracy of generated queries and overall user experience.


35-36: LGTM! Good defensive requirement.

While this requirement may seem implicit, explicitly instructing the LLM to ensure syntactic correctness is a good practice that reinforces the importance of generating valid, executable queries.


31-32: Timestamp format specification is accurate and verified.

The format YYYY-MM-DDTHH:mm:ss.fffZ with three-digit millisecond precision is correctly specified. The codebase validation confirms:

  • The trailing Z is mandatory (enforced in parse_timestamp_range() at utils.py:45)
  • Millisecond precision (.fff) is consistently documented across constants.py, server.py, and utils.py
  • The format matches exactly what the timestamp parsing and conversion functions expect

No changes needed.

@rishikeshdevsot
Copy link
Contributor

@kirkrodrigues @gibber9809 this PR's objective is to explain to the LLM the difference between regular KQL and KQL supported by CLP using as little text/tokens as possible. So one of you should probably take a second look to ensure that the important points are covered.

Some considerations:

  1. We don't want to list all the differences because that's too much text and it will confuse the LLM.
  2. We also can't rely on AI following this link because not all agents will be given internet access and even with internet access, an agent following a link is hit or miss.

@20001020ycx 20001020ycx changed the title fix(clp-mcp-server): Refine system prompt to improve LLM's KQL query generation to be CLP friendly fix(clp-mcp-server): Refine system prompt to make LLM's KQL query generation to be CLP friendly Oct 24, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
components/clp-mcp-server/clp_mcp_server/server/constants.py (1)

18-19: Clarify the interaction between specificity and wildcards.

The guidance to "be as specific as possible" might appear contradictory with the mandatory wildcard usage (lines 23-24), since wildcards broaden matches. Consider clarifying that "specific" refers to using specific field names and narrowing by time range or field values, while wildcards should still be used for keyword matching as required by CLP.

Example clarification:

-" query should be as specific as possible to minimize the number of log messages returned. When"
+" query should use specific field names, values, and time ranges to minimize the number of log"
+" messages returned. When"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f5e8e30 and 5c88582.

📒 Files selected for processing (1)
  • components/clp-mcp-server/clp_mcp_server/server/constants.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (2)
components/clp-mcp-server/clp_mcp_server/server/constants.py (2)

15-16: Good clarification of the target database.

The change from "log database using KQL" to "log database called CLP using KQL" makes the prompt more specific and sets the context for the CLP-specific guidance that follows.


21-22: Effective introduction to CLP-specific KQL.

This introduction appropriately sets the context that CLP uses a variant of KQL, preparing the LLM for the specific guidelines that follow. This aligns well with the PR objective to teach the LLM about CLP's KQL differences concisely.

Comment on lines 25 to 28
'- CRITICAL: Use double quotation(") to wrap multiple keywords search. For example, to search for:'
' hello world, your query parameter should literally be: "*hello world*" (with the quotes as part'
' of the search string wrapping around both wildcards and multiple words). Otherwise, the query'
' will fail.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Clarify the multi-word search syntax with examples.

The guidance on combining double quotes and wildcards for multi-word searches is confusing and may lead to incorrect query generation:

  1. Ambiguous phrasing: "with the quotes as part of the search string wrapping around both wildcards and multiple words" is difficult to parse. Does this mean:

    • The literal string is "*hello world*" including quote characters?
    • The quotes are KQL syntax (not literal) that wrap *hello world*?
  2. Missing context: The guidance doesn't clarify whether this applies to:

    • Field-value searches: field:"*multi word value*"
    • Free-text phrase searches: "*multi word phrase*"
    • Both?
  3. Syntax uncertainty: Is "*hello world*" actually valid CLP KQL syntax? In standard KQL, wildcards and phrase quotes serve different purposes and combining them this way is unusual.

Please verify the correct CLP KQL syntax for multi-word searches and provide clearer examples:

Consider restructuring the guidance to be more explicit:

- CRITICAL: For multi-word searches, combine wildcards and quotes:
  * Field searches: field:"*multi word value*"
  * Phrase searches: "*multi word phrase*"
  * The wildcards go inside the quotes, with quotes used to group the phrase.

🏁 Script executed:

#!/bin/bash
# Check CLP documentation or tests for multi-word wildcard search syntax
rg -n -C3 "multi.*word.*wildcard|wildcard.*phrase|quote.*wildcard" --type=md
rg -n -C3 '".*\*.*\s.*\*.*"' --type=py --type=js

Length of output: 1011


I'll verify the CLP KQL syntax by checking documentation and examining how queries are actually constructed in the codebase.

Let me search for more specific information about CLP unstructured text log syntax (since the constants seem to be for general search guidance):

Now let me check how the MCP server actually constructs and passes queries to understand if this guidance matches the implementation:

Let me search the codebase directly for how queries are constructed and used in the MCP server:Based on my research, I now have sufficient information to provide a comprehensive rewritten review. The key findings are:

CLP requires double quotes to wrap multi-word searches for keys or values, and queries with spaces are interpreted as substring searches. Most crucially, CLP automatically inserts wildcards at the beginning and end of the search string, making it a substring search by default.

The critical issue is that the guidance in the constants file is misleading because it suggests users must manually type "*hello world*" when in reality CLP automatically adds these wildcards. The phrase "with the quotes as part of the search string" conflates two distinct concepts: quoting syntax (for multi-word phrases) and wildcards (for substring matching).

Replace confusing multi-word search guidance with clarity on CLP's automatic wildcard handling.

The guidance on lines 25–28 conflates quoting syntax with wildcard syntax, leading to an incorrect example. CLP automatically inserts wildcards at the beginning and end of search strings, so users do not need to type "*hello world*" manually. The correct guidance should clarify that:

  • For multi-word phrases, use double quotes: "hello world"
  • CLP handles substring matching automatically without explicit wildcards
  • Wildcards are optional if the user wants to specify exact positions

Rewrite to:

'- CRITICAL: Use double quotation marks to wrap multiple-word searches. For example:'
' "hello world" will match log entries containing the phrase hello world.'
' CLP automatically performs substring matching, so you do not need to manually'
' add wildcards (*) unless you want to control their position.'

# fmt: off
SYSTEM_PROMPT = (
"You are an AI assistant that helps users query a log database using KQL (Kibana Query Language)."
"You are an AI assistant that helps users query a log database called CLP using KQL"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I came up with:

You are an AI assistant that helps users query a log database, called CLP, by using a variant of KQL
(Kibana Query Language) that we'll call CLP-KQL.

You should generate a CLP-KQL query that accurately expresses the user's intent. The generated query
should be as specific as possible to minimize the number of log messages returned. When displaying
log messages, wrap them in hyperlinks with the link field from the search result.

Consider the following guidelines when generating CLP-KQL queries:

  • To search for a substring (a partial match), use the wildcards * and ?.
  • To query a key or a value that contains multiple words separated by spaces, wrap the words in
    double quotation marks.
  • In keys, to search for any of the literal characters \, ", ., *, @, $, !, and #, escape
    them with a backslash.
  • In values, to search for any of the literal characters \, ", ?, and *, escape them with a
    backslash.
  • Consider using search_by_kql_with_timestamp_range to narrow the query to a specific time range.
  • CLP doesn't support fuzzy matches.
  • CLP doesn't support less/greater than comparisons on strings, IP addresses, or timestamps.
  • If the user's prompt is vague, ask clarifying questions to better understand their intent before
    generating the CLP-KQL query.

Copy link
Contributor Author

@20001020ycx 20001020ycx Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have problems with the rest of the points except the first two because we have specifically refined those two in order to make AI generate correct kql.
We have iterated many versions and only the current version ensures AI could understand and follow the rule. The takeaway from the experiment was that in order to mark a critical rule effective, we need to be:

  1. marking important rule with CRITICAL,
  2. explaining the rule with a concrete example

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can you write a revision?
  • Can we add a comment above the prompt explaining what characteristics need to be maintained based on your experiments? Otherwise the next person to edit this likely won't know either, lol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the second point and will add the rule as the comment right above the constant instantiation. But let's first make an agreement on the actual system prompt string.

Here is my revision:

You are an AI assistant that helps users query a log database, called CLP, by using a variant of KQL
(Kibana Query Language) that we'll call CLP-KQL.

You should generate a CLP-KQL query that accurately expresses the user's intent. The generated query
should be as specific as possible to minimize the number of log messages returned. When displaying
log messages, wrap them in hyperlinks with the link field from the search result.

Consider the following guidelines when generating CLP-KQL queries, and pay special attention to rules marked as CRITICAL - violation will cause query failure:

  • CRITICAL: To search for a substring keyword (a partial match), you MUST use wildcards (* or ?) at the beginning and end of your keyword.
  • CRITICAL: Use double quotation(") to wrap multiple keywords search that are separated by spaces. For example, to search for all logs containing: hello world, your query parameter should literally be: "hello world" (with the quotes as part of the search string wrapping around both wildcards and multiple words).
  • In keys, to search for any of the literal characters , ", ., *, @, $, !, and #, escape them with a backslash.
  • In values, to search for any of the literal characters , ", ?, and *, escape them with a backslash.
  • Consider using search_by_kql_with_timestamp_range to narrow the query to a specific time range. Timestamps must follow the ISO 8601 UTC format (YYYY-MM-DDTHH:mm:ss.fffZ), where the trailing Z indicates UTC.
  • CLP doesn't support fuzzy matches.
  • CLP doesn't support less/greater than comparisons on strings, IP addresses, or timestamps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are my rationales behind my edits, a key point that I would like to emphasize is that we would like this system prompt to be AI friendly, even if this might not be stricitly following a documentation style for human readers

  1. CRITICAL Rule Marking: LLMs do not guarantee enforcement of all system prompt rules—this was confirmed through empirical testing. Rules 1 and 2 are essential for generating correct CLP-KQL queries; without them, LLMs produce invalid syntax like message: hello world (missing wildcards and quotes). Marking these rules as "CRITICAL" significantly improves KQL success rate, and ensuring LLMs prioritize them even when other rules may be ignored.
  2. Action item first and example showing: What I change in the second point was I move the action item for AI to the beginning of the sentence. I found that effective in the experiement. Same argument goes with example.
  3. Definition of exact format of timestamp: query with timestamp that does not follows ISO 8601 in UTC format will be rejected, so we want to make sure AI generates the exact format
  4. Removing your last point (also this is what I have in my original commit): i feel this is something good to have. However, I found that system prompt's length directly impacts LLM attention, as we don't want LLM loses its focus on CRITICAL points by some minor rules or reminder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2nd point mentioned by @kirkrodrigues is updated in the pr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded in the design doc.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a4b8753 and 2f792c9.

📒 Files selected for processing (1)
  • components/clp-mcp-server/clp_mcp_server/server/constants.py (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: junhaoliao
PR: y-scope/clp#1405
File: components/clp-package-utils/pyproject.toml:5-15
Timestamp: 2025-10-13T03:24:35.074Z
Learning: In the y-scope/clp repository, the Python 3.9 to 3.10 version requirement change was intentionally deferred to a separate PR (after PR #1405) to reduce review effort, as decided in an offline discussion between junhaoliao and kirkrodrigues.
📚 Learning: 2025-05-07T16:56:35.687Z
Learnt from: LinZhihao-723
PR: y-scope/clp#882
File: components/core/src/clp/ffi/ir_stream/search/QueryHandlerImpl.cpp:378-402
Timestamp: 2025-05-07T16:56:35.687Z
Learning: In the CLP search component, the `evaluate_wildcard_filter` function should return `AstEvaluationResult::Pruned` when `node_id_value_pairs` is empty, not `AstEvaluationResult::False`. Empty node sets should be treated as "undetermined" rather than definitive non-matches.

Applied to files:

  • components/clp-mcp-server/clp_mcp_server/server/constants.py
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • components/clp-mcp-server/clp_mcp_server/server/constants.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (4)
components/clp-mcp-server/clp_mcp_server/server/constants.py (4)

14-20: Excellent meta-documentation for prompt design.

This comment directly addresses the maintainability concern raised in the review discussion and documents the empirically validated practices. Future editors will understand why critical rules are marked and structured as they are.


28-30: Core wildcard rule addresses the PR's primary objective.

This CRITICAL rule with concrete example directly solves the issue described in the PR objectives where register failed but *register* succeeded. The format follows the documented best practices and has been empirically validated.


42-45: Escaping rules are clear and well-differentiated.

The distinction between key and value escaping rules is clearly documented, helping LLMs generate syntactically correct queries.


49-50: Unsupported features clearly documented.

Explicitly listing unsupported features helps prevent LLMs from generating invalid queries.

SYSTEM_PROMPT = """You are an AI assistant for querying the CLP log database using CLP-KQL (CKQL).\n
Your job is to generate CKQL that faithfully expresses the user's intent and show key logs to the user
- Start broad to learn the schema/fields using wildcard searches like *, then narrow the query to return a manageable result set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Clarify the wildcard example for schema exploration.

The phrase "wildcard searches like *" could be ambiguous. Does this mean:

  • A literal query * to match all logs?
  • Using wildcards in general (e.g., *error*, *warn*)?

Consider rephrasing to make the intent explicit, such as: "Start broad to explore the schema/fields (e.g., use * to see all logs or *error* to find error-related logs), then narrow the query..."

🤖 Prompt for AI Agents
In components/clp-mcp-server/clp_mcp_server/server/constants.py around line 24,
the comment "Start broad to learn the schema/fields using wildcard searches like
*, then narrow the query to return a manageable result set." is ambiguous about
what "wildcard searches like *" means; update the comment to explicitly show
examples (e.g., use `*` to match all logs or `*error*` to find error-related
logs) and rephrase to: "Start broad to explore schema/fields (e.g., use `*` to
see all logs or `*error*` to find error-related logs), then narrow the query to
a manageable result set." Ensure the new text replaces the old line exactly and
follows the project's comment style.

Comment on lines 36 to 39
- CRITICAL -- Multi-word text must be quoted: wrap multi-word searches in double quotes.
Example:
request: "*GET wp-admin*"
(quotes and wildcards are required).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Multi-word rule is clear and empirically validated.

The example correctly demonstrates the combination of quotes and wildcards for multi-word substring searches. This format has been validated through testing.

Optional: Consider changing line 39 from "(quotes and wildcards are required)" to "(both quotes and wildcards are required)" to more explicitly emphasize the combination.

🤖 Prompt for AI Agents
In components/clp-mcp-server/clp_mcp_server/server/constants.py around lines 36
to 39, the comment example is fine but line 39's parenthetical phrasing "(quotes
and wildcards are required)" should be changed to "(both quotes and wildcards
are required)" for clarity; update that exact text to the new wording while
preserving the surrounding example and punctuation.

Comment on lines 47 to 48
- Time range: use search_by_kql_with_timestamp_range to constrain by time.\n
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing critical timestamp format specification.

In the review discussion, you explicitly stated (point #3 in rationale) that timestamp format is critical: "query with timestamp that does not follow ISO 8601 in UTC format will be rejected." You proposed including: "Timestamps must follow the ISO 8601 UTC format (`YYYY-MM-DDTHH:mm:ss.fffZ`), where the trailing `Z` indicates UTC."

This specification is missing from the current version. Without it, LLMs may generate timestamps in incorrect formats, causing query failures.

Consider adding the timestamp format specification:

-- Time range: use search_by_kql_with_timestamp_range to constrain by time.\n
+- Time range: use search_by_kql_with_timestamp_range to constrain by time.
+  Timestamps must follow ISO 8601 UTC format: YYYY-MM-DDTHH:mm:ss.fffZ (the trailing Z indicates UTC).
+  Example: 2024-10-15T08:30:00.000Z\n
🤖 Prompt for AI Agents
In components/clp-mcp-server/clp_mcp_server/server/constants.py around lines
47-48, the timestamp format requirement is missing; add a clear constant or
comment stating that all timestamps must follow the ISO 8601 UTC format
"YYYY-MM-DDTHH:mm:ss.fffZ" (with fractional seconds and trailing Z indicating
UTC) and update any related docstring or constant used by
search_by_kql_with_timestamp_range to reference this exact format so generated
queries and validations enforce/rely on that format.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e57098e and a2c493b.

📒 Files selected for processing (1)
  • components/clp-mcp-server/clp_mcp_server/server/constants.py (1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: junhaoliao
PR: y-scope/clp#1405
File: components/clp-package-utils/pyproject.toml:5-15
Timestamp: 2025-10-13T03:24:35.074Z
Learning: In the y-scope/clp repository, the Python 3.9 to 3.10 version requirement change was intentionally deferred to a separate PR (after PR #1405) to reduce review effort, as decided in an offline discussion between junhaoliao and kirkrodrigues.
📚 Learning: 2025-05-07T16:56:35.687Z
Learnt from: LinZhihao-723
PR: y-scope/clp#882
File: components/core/src/clp/ffi/ir_stream/search/QueryHandlerImpl.cpp:378-402
Timestamp: 2025-05-07T16:56:35.687Z
Learning: In the CLP search component, the `evaluate_wildcard_filter` function should return `AstEvaluationResult::Pruned` when `node_id_value_pairs` is empty, not `AstEvaluationResult::False`. Empty node sets should be treated as "undetermined" rather than definitive non-matches.

Applied to files:

  • components/clp-mcp-server/clp_mcp_server/server/constants.py
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • components/clp-mcp-server/clp_mcp_server/server/constants.py
📚 Learning: 2025-09-04T13:51:02.311Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1283
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:206-210
Timestamp: 2025-09-04T13:51:02.311Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/compress.py, an empty string ("") is a valid value for the --timestamp-key parameter, so validation should not treat empty strings as missing values.

Applied to files:

  • components/clp-mcp-server/clp_mcp_server/server/constants.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (2)
components/clp-mcp-server/clp_mcp_server/server/constants.py (2)

37-41: CRITICAL wildcard rule effectively addresses the PR objective.

The wildcard guidance is clearly marked as CRITICAL and includes a concrete example. This directly solves the problem described in the PR where queries like register failed but *register* would succeed. The action-first structure and terse language align with the empirical guidelines.


47-52: Multi-word search rule is clear and well-validated.

The CRITICAL marking, concrete example, and explicit note that "quotes and wildcards are required" effectively communicate the CLP-specific syntax. This aligns with the empirically validated format from your testing.

Comment on lines +61 to +62
- Time range: use search_by_kql_with_timestamp_range
to constrain by time.\n
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add missing timestamp format specification.

In the review discussion, you explicitly stated that timestamp format is critical: "query with timestamp that does not follow ISO 8601 in UTC format will be rejected." You proposed including the exact format specification, and this was flagged as a major issue in past reviews. Without this specification, LLMs may generate incorrectly formatted timestamps, causing query failures.

Apply this diff to add the timestamp format specification:

-- Time range: use search_by_kql_with_timestamp_range
- to constrain by time.\n
+- Time range: use search_by_kql_with_timestamp_range to constrain by time.
+  Timestamps must follow ISO 8601 UTC format: YYYY-MM-DDTHH:mm:ss.fffZ
+  (the trailing Z indicates UTC).
+  Example: 2024-10-15T08:30:00.000Z\n
🤖 Prompt for AI Agents
In components/clp-mcp-server/clp_mcp_server/server/constants.py around lines
61-62, the comment about time range lacks the required timestamp format; update
the comment to explicitly state that timestamps must be ISO 8601 in UTC and give
the exact expected format and example (e.g. YYYY-MM-DDTHH:MM:SS.sssZ) so callers
and LLMs will produce valid timestamps; modify the string/documentation there to
append this specification and an example, and run a quick search to ensure any
related helper names (search_by_kql_with_timestamp_range) and docs are
consistent with the added format text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants