Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect schema change detection in spanner cdc to bq template #2183

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ShuranZhang
Copy link
Contributor

@ShuranZhang ShuranZhang commented Feb 12, 2025

We should only query information schema if the recorded tracked columns is less than the # of columns from a mod for NEW_VALUES and OLD_AND_NEW_VALUES. For these two val capture types, it is expected to have less columns in new values json compare to all tracking columns and this doesn't necessarily means there are schema updates.

Basically detectDiffColumnInMod should return false if keySetOfNewValuesJsonObject is a subset of spannerTable.getNonPkColumns().

Before the fix the information schema columns query would trigger more often than needed, whenever there are less columns in new values json compare to all tracking columns for NEW_VALUES and OLD_AND_NEW_VALUES. This is more likely to occur if user has just add a new column and only try to only update that new column(where mod will only contain the new column).

Copy link

codecov bot commented Feb 13, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 46.97%. Comparing base (b93cad8) to head (c5f89b7).

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #2183   +/-   ##
=========================================
  Coverage     46.97%   46.97%           
- Complexity     4049     4050    +1     
=========================================
  Files           876      876           
  Lines         52193    52193           
  Branches       5502     5502           
=========================================
+ Hits          24517    24519    +2     
+ Misses        25920    25919    -1     
+ Partials       1756     1755    -1     
Components Coverage Δ
spanner-templates 68.91% <ø> (-0.01%) ⬇️
spanner-import-export 65.72% <ø> (-0.02%) ⬇️
spanner-live-forward-migration 76.50% <ø> (ø)
spanner-live-reverse-replication 78.77% <ø> (ø)
spanner-bulk-migration 87.94% <ø> (ø)
Files with missing lines Coverage Δ
...reamstobigquery/schemautils/SchemaUpdateUtils.java 65.45% <100.00%> (+5.45%) ⬆️

... and 1 file with indirect coverage changes

@@ -42,8 +42,8 @@ public static boolean detectDiffColumnInMod(
mod.getNewValuesJson() == ""
? new JSONObject("{}").keySet()
: new JSONObject(mod.getNewValuesJson()).keySet();
// At this mod's spannerCommitTimestamp, one column is added/dropped.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the user does end up dropping the column, but we don't update the stored schema?

Will we still do a read to Spanner for the dropped column? Will that cause NOT_FOUND errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right. Seems querying info schema columns is unavoidable even if there is no schema changes but this mod did not modify all tracking columns. This feels pretty inefficient.

What do you think about making this configurable? ie: Disable support schema update handling through a config. If disabled, we don't perform these info schema queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants