Avoid whole table scanning on incremental load #7016
Closed
luislema79
started this conversation in
Ideas
Replies: 1 comment
-
Glad you were able to find the answer you were looking for! Docs on |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
dbt currently allows for incremental strategy to perform delta loads using a unique key. However, when you merge into the destination table the whole destination table is scanned looking for an existing unique key. If the unique key is not there a new record will be inserted and if the unique key is there the record will be updated. This approach performs very slow when your destination table has billions of rows because again it will scan the whole table looking for those unique keys. Adding a clustering key on the unique key doesn't help because it has high cardinality since it's unique. I'm proposing adding a date window to the incremental strategy configuration to go back and check for the past month/year worth of data instead of checking every record since most of the time duplicates arrive within a few months of the original record. This approach will speed up the process and allow the users flexibility on how far back they want to check for duplicates. This change should happen for both merge and delete+insert strategies.
UPDATE: This is already addressed here: #5702
Beta Was this translation helpful? Give feedback.
All reactions