Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize join tables from different databases: executor #10146

Merged
merged 6 commits into from
Nov 14, 2024

Conversation

ea-rus
Copy link
Contributor

@ea-rus ea-rus commented Nov 11, 2024

Description

Updates:

Subselect step is used to get value from previous step data.

select distinct <column> from <step data1>

Next step is fetching data using these values as filter

select * from db2.table2 where <column> in (<ids from previous step>)

Side fix:
If join is without condition:

  • add 0=0 filter (multiply rows)
  • but use limitation. If expected number of rows is exceed limit - raise exception

Dependent on mindsdb/mindsdb_sql#412

Fixes #issue_number

Type of change

  • ⚡ New feature (non-breaking change which adds functionality)

Verification Process

To ensure the changes are working as expected:

  • Test Location: Specify the URL or path for testing.
  • Verification Steps: Outline the steps or queries needed to validate the change. Include any data, configurations, or actions required to reproduce or see the new functionality.

Additional Media:

  • I have attached a brief loom video or screenshots showcasing the new functionality or change.

Checklist:

  • [x My code follows the style guidelines(PEP 8) of MindsDB.
  • I have appropriately commented on my code, especially in complex areas.
  • Necessary documentation updates are either made or tracked in issues.
  • Relevant unit and integration tests are updated or added.

@ea-rus ea-rus requested a review from StpMax November 11, 2024 15:09
@@ -74,7 +74,11 @@ def adapt_condition(node, **kwargs):
return Identifier(parts=['table_b', col_name])

if step.query.condition is None:
raise NotSupportedYet('Unable to join table without condition')
# prevent memory overflow
if len(left_data) * len(right_data) < 10 ** 7:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are left_data and right_data dataframes? If so, then may be better to get real size (df.memory_usage(index=True, deep=True).sum()) and compare with free memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are ResultSets

@ZoranPandovski ZoranPandovski merged commit f71ba0c into main Nov 14, 2024
14 checks passed
@ZoranPandovski ZoranPandovski deleted the cte-support branch November 14, 2024 13:28
@mindsdb mindsdb locked and limited conversation to collaborators Nov 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants