Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL Extraction Issue with Multi-CTE Queries #751

Open
SynapticSolutionsAI opened this issue Jan 11, 2025 · 0 comments
Open

SQL Extraction Issue with Multi-CTE Queries #751

SynapticSolutionsAI opened this issue Jan 11, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@SynapticSolutionsAI
Copy link

When processing queries with multiple CTEs, Vanna's SQL extraction mechanism appears to start from the middle of the query rather than the beginning, resulting in syntax errors. The LLM generates the correct SQL, but the extraction process truncates it incorrectly.

Here's an anonymized example from our logs demonstrating the issue:
LLM Response (correct, complete SQL):
WITH first_cte AS ( SELECT id, timestamp FROM table1 t1 WHERE condition1 = 'value1' GROUP BY id ), second_cte AS ( SELECT t2.id, MAX(t2.date) AS latest_date FROM table2 t2 JOIN table3 t3 ON t3.id = t2.id WHERE condition2 = 'value2' GROUP BY t2.id ) SELECT t.name, t.date, t.metric FROM table4 t JOIN first_cte fc ON fc.id = t.id JOIN second_cte sc ON sc.id = t.id ORDER BY t.metric DESC LIMIT 50;

Extracted SQL (truncated, starts from middle):

SELECT t2.id, MAX(t2.date) AS latest_date FROM table2 t2 JOIN table3 t3 ON t3.id = t2.id WHERE condition2 = 'value2' GROUP BY t2.id ) SELECT t.name, t.date, t.metric FROM table4 t...

  1. The LLM response contains the complete, correct SQL query
  2. The extraction process consistently:
    • Misses the first WITH clause
    • Starts from a middle SELECT statement
    • Results in unmatched parentheses
    • Sometimes duplicates the truncated parts

Questions

  1. What causes the extraction to start from the middle of the query rather than the beginning and what is the pattern?
  2. Is there a way to ensure the extraction process captures the complete query, starting from the first WITH clause?
  3. Are there any patterns s or best practices to handle complex queries with multiple CTEs?

We've observed this behavior inconsistently with queries containing multiple CTEs, some queries work, while others don't. On this basis, the behavior is consistent, affected queries always fail, unaffected CTE queries always work. The issue appears to be in the extraction process rather than the LLM response, as the complete SQL in the LLM response is correct.

@SynapticSolutionsAI SynapticSolutionsAI added the bug Something isn't working label Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant