Skip to content

Conversation

@quocnguyendinh
Copy link
Collaborator

@quocnguyendinh quocnguyendinh commented Nov 7, 2025

https://kaligo.atlassian.net/browse/DATA-15789

Background

As mentioned in this thread, we need to have a mechanism in the diffa to increase the accuracy when we check diff.

Design

  • Adding the option for users to supply the diff-dimension
diffa data-diff --source-database 'guardhouse_staging' --source-table 'users' --target-schema 'guardhouse' --target-table 'users' --diff-dimensions status
  • When diff-dimensions are specified, diffa will group by according to these columns. When select, it will cast those columns to text to ensure the consistent type for the simpliticy
            SELECT 
                created_at::DATE as check_date,
                COUNT(*) AS cnt,
                status::text
            FROM public.users
            WHERE
                
                (
            created_at::DATE > '2025-11-05'
            AND 
            created_at::DATE <= CURRENT_DATE - INTERVAL '2 DAY' 
        )
        
            GROUP BY created_at::DATE, status
            ORDER BY created_at::DATE ASC
  • Other than this, users now will have the detailed checking stats according to check_date
            - 2025-11-03:
                summary: 
                    ✅ No Diff MergedCountCheck(source_count=116091, target_count=116091, check_date=datetime.date(2025, 11, 3), is_valid=True)
                detailed: 
                    ["✅ No Diff MergedCountCheck(source_count=26482, target_count=26482, check_date=datetime.date(2025, 11, 3), status='closed', is_valid=True)", "✅ No Diff MergedCountCheck(source_count=62840, target_count=62840, check_date=datetime.date(2025, 11, 3), status='active', is_valid=True)", "✅ No Diff MergedCountCheck(source_count=19, target_count=19, check_date=datetime.date(2025, 11, 3), status='admin', is_valid=True)", "✅ No Diff MergedCountCheck(source_count=26750, target_count=26750, check_date=datetime.date(2025, 11, 3), status='blocked', is_valid=True)"]
  • full-diff mode is supported allowing to re-check the table from the very first beginning.
    • In case, that table is already having some check data in the DB. Diffa will NOT use those data for the diff checks => It will always re-check that table like the first run.

Impact

  • Users now can provide diff-dimension => Diffa will group by according to these dimensions. The more dimensions are specified, the better accuracy diffa will give, but will trade-off with the performance.
  • Users can enable the full-diff mode to re-check a particular table.

Caveats

n/a

Testing

Working well in my local
Screenshot 2025-11-07 at 16 42 04
Screenshot 2025-11-07 at 16 41 40

  • full-diff mode
    Uploading Screenshot 2025-11-10 at 10.02.19.png…

Docs

n/a

Author Checklist

For author: please complete the checklist before marking the PR as ready to review.

Github Bot Commands

  • Type 'ferment n days' in a comment to apply the '🍞 fermenting' label and receive a notification in n days about it. n must be a positive integer. The bot will react with 👀 if it understood your request.
  • If you apply the 'auto_develop_merging' label to a PR, the bot will try to merge it to staging. In case of conflict or build failure it will fail, in that case you need to merge manually.

@quocnguyendinh quocnguyendinh self-assigned this Nov 7, 2025

This comment was marked as outdated.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@yvonnekong28 yvonnekong28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New dimensions and full diff functionality looks good 💯

  • Low risk - no impact to elt-run. Good to test it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants