-
Notifications
You must be signed in to change notification settings - Fork 477
Troubleshooting Query Plan Regressions guide #20893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
eeb57a9
c273994
b5b409a
e6a0bc7
8949a08
d21f6a7
1674f4f
2af6ed9
fbe9373
b892e57
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,162 @@ | ||||||
| --- | ||||||
| title: Troubleshoot Query Plan Regressions | ||||||
| summary: Troubleshooting guide for when the cost-based optimizer chooses a new query plan that slows performance. | ||||||
| keywords: query plan, cost-based optimizer, troubleshooting | ||||||
| toc: true | ||||||
| docs_area: manage | ||||||
| --- | ||||||
|
|
||||||
| This page provides guidance on identifying the source of [query plan]({% link {{page.version.version}}/cost-based-optimizer.md %}) regressions. | ||||||
|
|
||||||
| For any given SQL statement, if the [cost-based optimizer]({% link {{page.version.version}}/cost-based-optimizer.md %}) chooses a new query plan that slows performance, you may observe an unexpected increase in query latency. There are several reasons that the optimizer might choose a plan that increases execution time. This guide will help you understand, identify, and diagnose query plan regressions using built-in CockroachDB tools. | ||||||
|
|
||||||
| ## Before you begin | ||||||
|
|
||||||
| - [Understand how the cost-based optimizer chooses query plans]({% link {{page.version.version}}/cost-based-optimizer.md %}) based on table statistics, and how those statistics are refreshed. | ||||||
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| ## What to look out for | ||||||
bsanchez-the-roach marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
|
||||||
| Query plan regressions only increase the execution time of SQL statements that use that plan. This means that the overall service latency of the cluster will only be affected during the execution of statements that are run with the problematic query plan. | ||||||
|
||||||
| Query plan regressions only increase the execution time of SQL statements that use that plan. This means that the overall service latency of the cluster will only be affected during the execution of statements that are run with the problematic query plan. | |
| Query plan regressions increase the execution time only for SQL statements that use the affected plan. This means that the overall service latency of the cluster will only be affected during the execution of statements that are run with the problematic query plan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Query plan regressions only increase the execution time of SQL statements that use the affected plan. This means that the overall service latency of the cluster will only be affected during the execution of statements that are run with the problematic query plan.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This might make those latency spikes harder to identify. For example, if the problematic plan only affects a query that's run on an infrequent, ad-hoc basis, it might be difficult to notice a pattern among the graphs on the [**Metrics** page]({% link {{page.version.version}}/ui-overview.md %}#metrics). | |
| As a result, these latency spikes can be harder to identify. For example, if the problematic plan only affects a query that's run on an infrequent, ad-hoc basis, it might be difficult to notice a pattern among the graphs on the [**Metrics** page]({% link {{page.version.version}}/ui-overview.md %}#metrics). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a result, these latency spikes can be hard to identify. For example, if the problematic plan only affects a query that's run on an infrequent, ad-hoc basis, it might be difficult to notice a pattern among the graphs on the [Metrics page]({% link {{page.version.version}}/ui-overview.md %}#metrics).
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| If you observe that your application is responding more slowly than usual, and this behavior hasn't been explained by recent changes to table schemas or data, or by changes to cluster workloads, it's worth considering a query plan regression. | |
| If you observe that your application is responding more slowly than usual, and it isn’t explained by recent changes to table schemas, data, or cluster workloads, it's worth considering a query plan regression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you observe that your application is responding more slowly than usual, and this behavior isn’t explained by recent changes to table schemas, data, or cluster workloads, it's worth considering a query plan regression.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| If application performance slows at a particular time of day, note the time interval so that you can isolate SQL statements that tend to run in that interval. | |
| If performance slows at a particular time of day, note the interval to help isolate SQL statements that typically run during that time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If performance slows at a particular time of day, note the time interval to help isolate SQL statements that typically run during that time.
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| If the newer plan is the same as the older plan (if it has the same Plan Gist), then there was no query plan regression, because the plan hasn't changed. | |
| If the newer plan matches the older plan (i.e., it has the same **Plan Gist**), there was no query plan regression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the newer plan matches the older plan (if it has the same Plan Gist), there was no query plan regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is purely based on the doc style guide which says not to use latinisms.
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - Compare the **Average Rows Read** of the two plans. If the value for the newer plan is significantly higher than the value for the older plan (as in: an order of magnitude) it's very possible that this is due to a query plan regression. If the value for the newer plan is only moderately higher than the value for the older plan, it's possible that this is due to a query plan regression, but it's also possible that this is due to normal table growth. An increase in this value could be causing an increase in the average execution time. | |
| - Compare the **Average Rows Read** of the two plans. If the newer plan’s value is significantly higher, such as an order of magnitude greater, it likely indicates a query plan regression. If the value for the newer plan is only moderately higher than the value for the older plan, it's possible that this is due to a query plan regression, but it's also possible that this is due to normal table growth. This increase may contribute to higher average execution time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compare the Average Rows Read of the two plans. If the newer plan’s value is significantly higher (such as an order of magnitude greater), it likely indicates a query plan regression. If the value for the newer plan is only moderately higher than the value for the older plan, it's possible that this is due to a query plan regression, but it's also possible that this is due to normal table growth. This increase may contribute to higher average execution time.
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| If you were unable to identify a specific moment in time when the latency increased, you won't have a specific "before" and "after" to compare. If this is the case, it would still be useful to have a vague sense of the time of the increase (using the methods in Step 1), even if that range is many hours long. You can then use the above methods (in Step 3) to compare query plans on a rolling basis by changing the custom time interval to consecutive hour-long intervals. This might help you discover the specific time interval in which a sudden latency increase occurred. | |
| If you couldn’t identify a specific moment when latency increased, you won’t have a clear "before" and "after" to compare. In this case, it’s still helpful to have a general sense of when the increase occurred (using the methods from Step 1) even if the range spans several hours. You can then use the above methods (in Step 3) to compare query plans on a rolling basis by changing the custom time interval to consecutive hour-long intervals. This approach can help identify the specific interval when the latency spike occurred. |
bsanchez-the-roach marked this conversation as resolved.
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 1. In the **Explain Plans** tab, click on the Plan Gist of the more recent plan to see it in more detail. | |
| 1. In the **Explain Plans** tab, click the **Plan Gist** of the more recent plan to view its details. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 2. Click on **All Plans** above to return to the list of plans. | |
| 2. Click on **All Plans** above to return to the full list. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 3. Click on the Plan Gist of the previous plan to inspect it in more detail. Compare the two plans to understand what changed. They might be using different indexes. They also might be scanning different portions of the table, or using different join strategies. | |
| 3. Click on the Plan Gist of the previous plan to inspect it in more detail. Compare the two plans to understand what changed. They may use different indexes. They may also scan different parts of the table or use different join strategies. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 1. Look at the **Used Indexes** column for the older and the newer query plans. If these aren't the same, it's likely that the creation or deletion of an index resulted in a change to the statement's query plan. | |
| 1. Check the **Used Indexes** column for both the older and newer query plans. If these aren't the same, it's likely that the creation or deletion of an index resulted in a change to the statement's query plan. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 2. In the **Explain Plans** tab, click on the Plan Gist of the more recent plan to see it in more detail. Identify the table(s) used in the initial "scan" step of the plan. | |
| 2. In the **Explain Plans** tab, click the **Plan Gist** of the more recent plan to view its details. Identify the table(s) used in the initial "scan" step of the plan. |
bsanchez-the-roach marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| It's possible that the new index is well-chosen but that the schema change triggered a statistics refresh that is the root problem. It's also possible that the new index is not ideal. Think about how and when this table gets queried, to determine if the index should be reconsidered. [Check the **Insights** page for index recommendations]({% link {{ page.version.version }}/ui-insights-page.md %}#suboptimal-plan), and read more about [secondary index best practices]({% link {{ page.version.version }}/schema-design-indexes.md %}#best-practices). | |
| The new index may be well-chosen, but the schema change could have triggered a statistics refresh that caused the issue. It's also possible that the new index is not ideal. Consider how and when the table is queried to determine whether the index should be reconsidered. [Check the **Insights** page for index recommendations]({% link {{ page.version.version }}/ui-insights-page.md %}#suboptimal-plan), and read more about [secondary index best practices]({% link {{ page.version.version }}/schema-design-indexes.md %}#best-practices). |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 1. In the **Explain Plans** tab, click on the Plan Gist of the more recent plan to see it in more detail. Identify the table used in the initial "scan" step of the plan. | |
| 1. In the **Explain Plans** tab, click the Plan Gist of the more recent plan to view its details. Identify the table used in the initial "scan" step of the plan. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| If you suspect that the query plan change is the cause of the latency increase, and you suspect that the query plan changed due to stale statistics, you may want to [manually refresh the statistics for the table]({% link {{ page.version.version }}/create-statistics.md %}#examples). | |
| If you suspect that stale statistics caused the plan change and resulting latency increase, consider [manually refreshing the table’s statistics]({% link {{ page.version.version }}/create-statistics.md %}#examples). |
bsanchez-the-roach marked this conversation as resolved.
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| If the SQL statement fingerprint contains placeholder values ("_"), it's possible that a change in that literal is responsible for a query plan regression. This is also worth considering in the case of [multiple valid query plans](#multiple-valid-query-plans), if a change in the distribution of plans has led to a higher average execution time. | |
| If the SQL statement fingerprint contains placeholder values ("_"), a change in a literal may be responsible for a query plan regression. This is also worth considering in the case of [multiple valid query plans](#multiple-valid-query-plans), if a change in the distribution of plans has led to a higher average execution time. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Inspect your application to see if the literals being used within the query executions are changing. | |
| Inspect your application to determine whether the query literals are changing between executions. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| If you suspect that the query plan change is the cause of the latency increase, and you suspect that the query plan changed due to a changed query literal, it's possible that the table statistics don't accurately reflect how the literal values are represented in the data. You may want to [manually refresh the statistics for the table]({% link {{ page.version.version }}/create-statistics.md %}#examples). It's also possible that the table indexes are not helpful for queries with the newer literal value, in which case you may want to [check the **Insights** page for index recommendations]({% link {{ page.version.version }}/ui-insights-page.md %}#suboptimal-plan). | |
| If you suspect the plan change caused the latency increase and was triggered by a changed query literal, table statistics may not accurately reflect how those values appear in the data. You may want to [manually refresh the statistics for the table]({% link {{ page.version.version }}/create-statistics.md %}#examples). It’s also possible that the current indexes aren’t effective for queries with the new literal value. In that case, [check the **Insights** page for index recommendations]({% link {{ page.version.version }}/ui-insights-page.md %}#suboptimal-plan). |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| If this does not fix the issue, a more drastic redesign of the schema or application may be needed. | |
| If the issue persists, a more substantial redesign of the schema or application may be required. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 2. Go the [**Events** panel]({% link {{page.version.version}}/ui-runtime-dashboard.md %}#events-panel) on the right. Scroll to the bottom, and click **View All Events**. | |
| 2. Go to the [**Events** panel]({% link {{page.version.version}}/ui-runtime-dashboard.md %}#events-panel) on the right. Scroll to the bottom, and click **View All Events**. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| See if any events occured around that time that may have contributed to a query plan regression. These might include schema changes that affect tables involved in the suspect SQL queries, [changed cluster settings]({% link {{ page.version.version }}/set-cluster-setting.md %}), created or dropped indexes, and more. | |
| Check for any events around that time that may have contributed to a query plan regression. These may include schema changes affecting tables in suspect SQL queries, [modified cluster settings]({% link {{ page.version.version }}/set-cluster-setting.md %}), created or dropped indexes, and more. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| A consequential event around the time of the latency increase may have affected the way that the optimizer chose query plans. Inspect changed cluster settings, or [determine if the table indexes changed](#determine-if-the-table-indexes-changed). | |
| An event around the time of the latency increase may have influenced how the optimizer selected query plans. Inspect changed cluster settings, or [determine if the table indexes changed](#determine-if-the-table-indexes-changed). |
Uh oh!
There was an error while loading. Please reload this page.