-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Source Salesforce: Bulk stream uses async CDK components #45678
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
… able to create dev release
Signed-off-by: Artem Inzhyyants <[email protected]> Co-authored-by: maxi297 <[email protected]>
884dfde
to
41de741
Compare
We are getting a seg fault on some of the workspaces we've progressively rolled out to. We have two pre-build that might help us debug:
|
41de741
to
8d02497
Compare
…ob-salesforce/salesforce-release`) Here are a few optimizations to enhance the performance of the program. 1. Remove multiple `import logging` statements. 2. Use lazy logging directly instead of creating a separate `lazy_log` function. 3. Simplify the job replacement logic to avoid unnecessary operations. Here is the optimized version. ### Improvements 1. **Logging**. - Removed the separate `lazy_log` function and inlined the `isEnabledFor` check. - Doing this reduces function calls and enables directly logging only when necessary. 2. **Synchronization**. - Kept the lock usage the same to ensure thread safety, which is necessary for modifying the `_jobs` set. These optimizations reduce the overhead and simplify the logic while ensuring the thread safety and functionality remain intact.
⚡️ Codeflash found optimizations for this PR📄
|
airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py
Outdated
Show resolved
Hide resolved
"200k records", | ||
], | ||
) | ||
def test_memory_download_data(stream_config, stream_api, n_records, first_size, first_peak): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since memory test is removed, should we add another integration test with @pytest.mark.limit_memory(" MB"), example:
https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/python/unit_tests/sources/declarative/decoders/test_json_decoder.py#L55-L56
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was indeed a memory issue. I created this PR to address the issue directly in the CDK. Once you approve it, I'll resolve this issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What
Following the CDK release of the improve async job components, we will be able to release Salesforce that relies on those CDK components in order to sync bulk streams.
How
See #45673 for more details.
We are currently rolling out this change using dev version
2.5.32-dev.0a58b0968c
. The goal for the progressive rollout is roughly:Review guide
See #45673 for more details.
User Impact
There should be no user impact as the goal is to make the maintenance easier for us.
There is one case we willingly changed the behavior and it his here where before, the connector would retry a whole job given we could not download the result. It has been removed because:
8bb4614b-cc29-41b0-bc7c-86d4f52bab62
. As there were 5 logs forDownloading data failed after 0 retries. Retrying the whole job...
, there were only two forDownloading data failed even after 1 retries. Stopping retry and raising exception
which seems to indicate that retrying helped in that specific case or the stream stopped before. As the logs are a bit screwed up, I can't determine which one is true here. In any case, I would assume that we can push back if it's only for one customer and retrying would work on the next attempt/job.Given that we were to see that case in prod, the fix would be to have
AsyncRetriever.read_records
create a new factory using it's factory and having just one slice.Can this PR be safely reverted and rolled back?