Skip to content

fix(sql): Add fallback to source_defined_primary_key in CatalogProvider #627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jul 4, 2025

Conversation

devin-ai-integration[bot]
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Jun 27, 2025

Related:

Fix primary key fallback in CatalogProvider for SQL destinations

Summary

This PR fixes a critical bug in the Airbyte Python CDK where the CatalogProvider.get_primary_keys() method ignores source-defined primary keys when configured primary keys are empty or None. This affects all destinations using the CDK's SQL processor base classes, not just MotherDuck.

Root Cause: The CDK's CatalogProvider.get_primary_keys() method only uses ConfiguredAirbyteStream.primary_key and returns an empty list when it's falsy, without falling back to stream.source_defined_primary_key as specified in the Airbyte protocol.

Fix: Added fallback logic to use stream.source_defined_primary_key when primary_key is empty/None, with comprehensive unit test coverage.

Review & Testing Checklist for Human

⚠️ HIGH RISK CHANGE - This modifies core CDK logic affecting all SQL destinations

  • Verify protocol compliance: Confirm that the Airbyte protocol specification actually requires fallback to source_defined_primary_key when primary_key is empty/None
  • End-to-end testing: Test with actual destination connectors (especially MotherDuck, DuckDB, PostgreSQL) to verify the fix works correctly
  • Check for existing workarounds: Review other SQL destinations to see if any have implemented workarounds for this bug that might break with this change
  • Regression testing: Verify that destinations with explicitly set primary_key still work correctly (configured primary key should take precedence)
  • Performance impact: Confirm the additional configured_stream.stream.source_defined_primary_key lookup doesn't cause performance issues

Recommended Test Plan

  1. Create test cases with catalogs having empty primary_key but non-empty source_defined_primary_key
  2. Test sync operations with deduplication modes that rely on primary keys
  3. Verify SQL generation includes correct primary key columns from source-defined keys
  4. Test with multiple destinations to ensure no regressions

Diagram

graph TD
    A[airbyte_cdk/sql/shared/catalog_providers.py]:::major-edit
    B[unit_tests/sql/shared/test_catalog_providers.py]:::major-edit
    C[ConfiguredAirbyteStream.primary_key]:::context
    D[AirbyteStream.source_defined_primary_key]:::context
    E[SQL Destinations]:::context
    F[MotherDuck Destination]:::context
    G[DuckDB Destination]:::context
    H[Other SQL Destinations]:::context

    A --> C
    A --> D
    A --> E
    E --> F
    E --> G  
    E --> H
    B --> A

    subgraph Legend
        L1[Major Edit]:::major-edit
        L2[Minor Edit]:::minor-edit
        L3[Context/No Edit]:::context
    end

    classDef major-edit fill:#90EE90
    classDef minor-edit fill:#87CEEB
    classDef context fill:#FFFFFF
Loading

Notes

  • Documentation References: According to the Airbyte protocol, primary_key in ConfiguredAirbyteStream overrides source-defined primary keys when set, and source_defined_primary_key provides the default from the source
  • CI Test Failure: The MotherDuck destination test failure in CI is unrelated to this change (destinations don't support 'discover' operation)
  • Backward Compatibility: This change should be backward compatible as it only adds fallback behavior when no configured primary key exists
  • Test Coverage: Added 7 comprehensive unit tests covering all scenarios including configured precedence, fallback behavior, composite keys, and case normalization

Important

Auto-merge enabled.

This PR is set to merge automatically when all requirements are met.

- Fix CatalogProvider.get_primary_keys() to fall back to stream.source_defined_primary_key when primary_key is empty/None
- Add comprehensive unit tests covering all fallback scenarios
- Resolves bug where destinations ignore source-defined primary keys when configured primary key is not set
- Affects all destinations using CDK SQL processor base classes

Co-Authored-By: AJ Steers <[email protected]>
Copy link
Contributor Author

Original prompt from AJ Steers:

Received message in Slack channel #ask-devin-ai:

@Devin - we've identified a bug in the MotherDuck destination. Specifically, it looks like that implementation does not consider overridden primary key for streams described in the configured catalog. Instead, it seems to always use the source defined primary key. Can you look into this and (1) confirm, and (2) if confirmed propose a fix?

Copy link
Contributor Author

devin-ai-integration bot commented Jun 27, 2025

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions github-actions bot added the bug Something isn't working label Jun 27, 2025
Copy link

github-actions bot commented Jun 27, 2025

PyTest Results (Fast)

3 693 tests  +17   3 682 ✅ +17   6m 21s ⏱️ +5s
    1 suites ± 0      11 💤 ± 0 
    1 files   ± 0       0 ❌ ± 0 

Results for commit 10f39d2. ± Comparison against base commit 5824a5e.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Jun 27, 2025

PyTest Results (Full)

3 696 tests  +17   3 685 ✅ +17   18m 0s ⏱️ ±0s
    1 suites ± 0      11 💤 ± 0 
    1 files   ± 0       0 ❌ ± 0 

Results for commit 10f39d2. ± Comparison against base commit 5824a5e.

♻️ This comment has been updated with latest results.

@aaronsteers aaronsteers self-requested a review June 27, 2025 23:10
@aaronsteers
Copy link
Contributor

Looks like our new test suite assumes all Python connectors are sources. Since this is a destination and doesn't support 'discover', the discover test fails.

Nothing we need to solve here in this PR and not related to our change.

E           AssertionError: Expected no errors but got 1: 
E           
E           AirbyteErrorTraceMessage(message='Something went wrong in the connector. See the logs for more details.', internal_message="'DestinationMotherDuck' object has no attribute 'discover'", stack_trace='Traceback (most recent call last):
E             File "/home/runner/.cache/pypoetry/virtualenvs/airbyte-destination-motherduck-BV7hPlm3-py3.11/lib/python3.11/site-packages/airbyte_cdk/test/entrypoint_wrapper.py", line 325, in _run_command
E               for message in source_entrypoint.run(parsed_args):
E             File "/home/runner/.cache/pypoetry/virtualenvs/airbyte-destination-motherduck-BV7hPlm3-py3.11/lib/python3.11/site-packages/airbyte_cdk/entrypoint.py", line 199, in run
E               yield from map(
E             File "/home/runner/.cache/pypoetry/virtualenvs/airbyte-destination-motherduck-BV7hPlm3-py3.11/lib/python3.11/site-packages/airbyte_cdk/entrypoint.py", line 266, in discover
E               catalog = self.source.discover(self.logger, config)
E                         ^^^^^^^^^^^^^^^^^^^^
E           AttributeError: \'DestinationMotherDuck\' object has no attribute \'discover\'
E           ', failure_type=<FailureType.system_error: 'system_error'>, stream_descriptor=None)

/home/runner/.cache/pypoetry/virtualenvs/airbyte-destination-motherduck-BV7hPlm3-py3.11/lib/python3.11/site-packages/airbyte_cdk/test/standard_tests/_job_runner.py:128: AssertionError
______________ TestSuite.test_basic_read['config' Test Scenario] _______________

@aaronsteers
Copy link
Contributor

/bump-version

devin-ai-integration bot added a commit to airbytehq/airbyte that referenced this pull request Jun 28, 2025
…y fix

- Update airbyte-cdk dependency to use dev branch devin/1751064114-fix-primary-key-fallback
- This branch contains the fix for primary key fallback logic in CatalogProvider
- Allows end-to-end testing of the primary key fix with MotherDuck destination
- References CDK PR: airbytehq/airbyte-python-cdk#627

Co-Authored-By: AJ Steers <[email protected]>
devin-ai-integration bot and others added 2 commits June 28, 2025 05:22
Address @aaronsteers feedback to use more concise approach:
pks = configured_stream.primary_key or configured_stream.stream.source_defined_primary_key or []

This maintains exact same functionality while being more readable.

Co-Authored-By: AJ Steers <[email protected]>
- Consolidate 7 individual test methods into 1 parametrized test with 6 scenarios
- Use list[str] format for parameters with wrapping logic in test
- Remove case normalization test since CatalogProvider doesn't handle normalization
- Reduce code from ~150 lines to ~50 lines while maintaining full coverage

Co-Authored-By: AJ Steers <[email protected]>
devin-ai-integration bot and others added 3 commits June 28, 2025 22:54
…behavior

- Add explanation that method uses primary_key if set explicitly in configured catalog
- Otherwise falls back to source_defined_primary_key if set
- Addresses GitHub comment from @aaronsteers

Co-Authored-By: AJ Steers <[email protected]>
- Fix Ruff Format Check CI failure
- Apply proper formatting to expanded docstring in get_primary_keys method

Co-Authored-By: AJ Steers <[email protected]>
… unset

- Reverse priority order to favor source_defined_primary_key over primary_key
- Return None when neither primary key field is set
- Update all callers to handle None gracefully with 'pks or []' coalescing
- Update unit tests to reflect new behavior and priority order
- Addresses @aaronsteers feedback on primary key fallback logic

Co-Authored-By: AJ Steers <[email protected]>
…perations

- Replace 'or []' coalescing with explicit guard statements in merge methods
- Raise AirbyteInternalError when no primary keys available for merge operations
- Addresses @aaronsteers feedback on code clarity and error handling
- Ensures merge operations fail fast when primary keys are missing

Co-Authored-By: AJ Steers <[email protected]>
- Split long docstring line in get_primary_keys method across multiple lines
- Addresses @aaronsteers feedback on line length issue in PR comment
- Maintains readability while complying with formatting standards

Co-Authored-By: AJ Steers <[email protected]>
@aaronsteers aaronsteers changed the title fix(sql): Add fallback to source_defined_primary_key in CatalogProvider fix(sql): Add fallback to source_defined_primary_key in CatalogProvider Jul 3, 2025
…lues

- Updated docstring to specify 'primary key column names' instead of just 'primary keys'
- Added Returns section explaining the method returns column names or None
- Addresses @dbgold17's clarifying question about terminology in GitHub comment

Co-Authored-By: AJ Steers <[email protected]>
Copy link
Contributor Author

Hi @dbgold17! Great question about the terminology. I've just updated the docstring to clarify this.

The method returns column names (not actual key values). Specifically:

  • Return type: list[str] | None - these strings are column names
  • Usage: The returned strings are used for SQL column references and identifier quoting
  • Updated docstring: Now says "primary key column names" and includes a Returns section

The confusion was understandable since the original docstring just said "primary keys" which could be ambiguous. Thanks for pointing this out!

Copy link
Contributor

@dbgold17 dbgold17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed on a huddle. LGTM minus some variable name confusion

@aaronsteers aaronsteers enabled auto-merge (squash) July 4, 2025 00:53
@aaronsteers aaronsteers merged commit acb92dd into main Jul 4, 2025
25 of 26 checks passed
@aaronsteers aaronsteers deleted the devin/1751064114-fix-primary-key-fallback branch July 4, 2025 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants