Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Source Salesforce: Bulk stream uses async CDK components #45678

Merged
merged 29 commits into from
Oct 3, 2024

Conversation

maxi297
Copy link
Contributor

@maxi297 maxi297 commented Sep 19, 2024

What

Following the CDK release of the improve async job components, we will be able to release Salesforce that relies on those CDK components in order to sync bulk streams.

How

See #45673 for more details.

We are currently rolling out this change using dev version 2.5.32-dev.0a58b0968c. The goal for the progressive rollout is roughly:

  • 2024-09-19: internal workspaces
  • 2024-09-20: 5 external workspaces (at least one including parent streams) (see update here)
  • 2024-09-23: 20 external workspaces
  • 2024-09-24: Full release if no issues discovered

Review guide

See #45673 for more details.

User Impact

There should be no user impact as the goal is to make the maintenance easier for us.

There is one case we willingly changed the behavior and it his here where before, the connector would retry a whole job given we could not download the result. It has been removed because:

  • It is harder to port back to the new version of the CDK because the job creation and the download are now in two separate code paths (stream_slices and read_records instead of just read_records)
  • The value is very unclear as this was done as an attempted patch without being able to clearly monitor the value of the fix. (We see the warning log in the last 7 days). All of these happened on the same connection, 8bb4614b-cc29-41b0-bc7c-86d4f52bab62. As there were 5 logs for Downloading data failed after 0 retries. Retrying the whole job..., there were only two for Downloading data failed even after 1 retries. Stopping retry and raising exception which seems to indicate that retrying helped in that specific case or the stream stopped before. As the logs are a bit screwed up, I can't determine which one is true here. In any case, I would assume that we can push back if it's only for one customer and retrying would work on the next attempt/job.

Given that we were to see that case in prod, the fix would be to have AsyncRetriever.read_records create a new factory using it's factory and having just one slice.

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

Copy link

vercel bot commented Sep 19, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Oct 2, 2024 2:16pm

@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Sep 20, 2024
@maxi297 maxi297 changed the title Salesforce release without CDK release ✨ Source Salesforce: Bulk stream uses async CDK components Sep 20, 2024
@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Sep 22, 2024
@maxi297 maxi297 force-pushed the async-job-salesforce/salesforce-release branch from 884dfde to 41de741 Compare September 22, 2024 18:22
@octavia-squidington-iii octavia-squidington-iii removed the CDK Connector Development Kit label Sep 22, 2024
@maxi297
Copy link
Contributor Author

maxi297 commented Sep 22, 2024

We are getting a seg fault on some of the workspaces we've progressively rolled out to. We have two pre-build that might help us debug:

  • 2.6.0-dev.884dfdee72: Better handling on breaking errors during polling
  • 2.6.0-dev.41de741141: Better logging on seg fault crash

@maxi297 maxi297 force-pushed the async-job-salesforce/salesforce-release branch from 41de741 to 8d02497 Compare September 23, 2024 00:45
@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Sep 23, 2024
codeflash-ai bot added a commit that referenced this pull request Sep 23, 2024
…ob-salesforce/salesforce-release`)

Here are a few optimizations to enhance the performance of the program.

1. Remove multiple `import logging` statements.
2. Use lazy logging directly instead of creating a separate `lazy_log` function.
3. Simplify the job replacement logic to avoid unnecessary operations.

Here is the optimized version.



### Improvements
1. **Logging**.
    - Removed the separate `lazy_log` function and inlined the `isEnabledFor` check.
    - Doing this reduces function calls and enables directly logging only when necessary.
   
2. **Synchronization**.
    - Kept the lock usage the same to ensure thread safety, which is necessary for modifying the `_jobs` set.

These optimizations reduce the overhead and simplify the logic while ensuring the thread safety and functionality remain intact.
Copy link

codeflash-ai bot commented Sep 23, 2024

⚡️ Codeflash found optimizations for this PR

📄 JobTracker.add_job() in airbyte-cdk/python/airbyte_cdk/sources/declarative/async_job/job_tracker.py

📈 Performance improved by 24% (0.24x faster)

⏱️ Runtime went down from 38.5 microseconds to 31.0 microseconds

I created a new dependent PR with the suggested changes. Please review:

If you approve, it will be merged into this PR (branch async-job-salesforce/salesforce-release).

@octavia-squidington-iii octavia-squidington-iii removed the CDK Connector Development Kit label Sep 25, 2024
"200k records",
],
)
def test_memory_download_data(stream_config, stream_api, n_records, first_size, first_peak):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since memory test is removed, should we add another integration test with @pytest.mark.limit_memory(" MB"), example:
https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/unit_tests/sources/declarative/decoders/test_json_decoder.py#L55-L56

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was indeed a memory issue. I created this PR to address the issue directly in the CDK. Once you approve it, I'll resolve this issue

Base automatically changed from async-job-salesforce/cdk-release to master October 1, 2024 12:48
Copy link
Collaborator

@artem1205 artem1205 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@maxi297
Copy link
Contributor Author

maxi297 commented Oct 2, 2024

/approve-regression-tests The check in the regression testing always seems to fail. Here is an example for this PR which only update dependencies

Check job output.

✅ Approving regression tests

@maxi297 maxi297 merged commit 07365b0 into master Oct 3, 2024
34 checks passed
@maxi297 maxi297 deleted the async-job-salesforce/salesforce-release branch October 3, 2024 12:42
@maxi297 maxi297 mentioned this pull request Nov 4, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/salesforce
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants