-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Add Datasource Type Exclusion from Schema Refresh #7572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Datasource Type Exclusion from Schema Refresh #7572
Conversation
- Add SCHEMAS_REFRESH_EXCLUDED_TYPES setting with default 'results,python' - Add type-based exclusion check in refresh_schemas() - Prevents unnecessary errors for datasources without schema support
|
Thank you for your PR with the detailed description ! Just a question.
May I hear what kind of performance impact you are worrying ? |
|
Thank you for the question! You're right - the performance impact of exception catching would be minimal in this case. The concern was more about the implementation approach rather than actual performance. The exception catching approach would look like: try:
ds.query_runner.get_schema(get_stats=False)
refresh_schema.delay(ds.id)
except NotSupported:
logger.info("skip: no schema support")However, this approach has a conceptual issue: we'd be calling
Additionally, when there are many datasources:
With type-based exclusion:
That said, the performance difference is likely negligible in practice. The main benefit is code clarity and avoiding unnecessary method calls. If the maintainers prefer the exception catching approach for better automatic detection, I'm happy to change it. What do you think? |
|
Thank you very much for you detailed explanation !
Yes, but I'm just feeling, if we just ignore the exception, it is not "checking" but just "ignoring".
I think, at least for query_results / python data sources you described, calling get_schema() does not initialize the connections.
I'm just feeling that at the point we ignore the error, the original issue was already solved.
Yes, it might be meaningless. (Although, performance impact will be minimum.)
I think the impact is very little. (Probably less than 0.1sec ?) What I'm feeling is that the attributes(ex: schema listing is supported or not.) for each Data Source is nice to be encapsulated inside each Data Source class, is not defined at the global variables, if possible. How about ? |
|
Thank you for the thoughtful feedback! 1. I understand your concern about encapsulationYou're absolutely right - ideally, each DataSource class should know its own capabilities rather than relying on global configuration. Adding 2. The practical challengeHowever, implementing this properly would require:
While this is architecturally ideal, it introduces additional complexity for the community. At this point in time, requiring every data source developer to be aware of and implement this method doesn't seem like the best approach for the ecosystem. 3. Proposed approach: Empty default valueTo address your encapsulation concern, I propose changing the default to empty string SCHEMAS_REFRESH_EXCLUDED_TYPES = set_from_string(
os.environ.get("REDASH_SCHEMAS_REFRESH_EXCLUDED_TYPES", "")
)This way:
This keeps the door open for future improvements (like To be clear about the goal of this PR: I want to add a mechanism to exclude data sources from the periodic
This prevents unnecessary error logs from being generated every 30 minutes for data sources that will never support schema refresh. What do you think about this approach? |
- Remove hardcoded 'results,python' default - Allow operators to opt-in to exclusion via environment variable - Code makes no assumptions about which types should be excluded
f5515e2 to
f173962
Compare
|
Thank you for your detailed comment ! And, may we make clear the purpose of the this PR ?
I can understand the purpose 1 . For about 2, I think this PR does not solve the problem (if the "endpoint" means API endpoint.) And, if the the problem this PR need to solve only 1, I feel we can just remove the error logs and we don't need to introduce extra global variable. How about ? |
|
Thank you for pointing this out! You're absolutely right. After investigation, I found: About Purpose 2 (Preventing endpoint access)I verified the actual code and confirmed that no API access occurs before the NotSupported exception is raised. Example: google_spreadsheets def get_schema(self, get_stats=False):
raise NotSupported() # Exception raised immediately
def _get_spreadsheet_service(self):
# API connection happens here, but NOT called from get_schema()
# Only called from run_query()Other unsupported datasources (json, url, python, results, etc.) also raise NotSupported immediately in get_schema() without any external access. About Purpose 3 (Performance improvement)I measured the exception handling overhead and found it negligible (approximately 0.01 seconds per datasource). ConclusionThe actual problem to solve is only Purpose 1 (cleaning up error logs). As you suggested, the exception handling approach is simpler and more appropriate. I've created a new PR #7573 which:
I'll close this PR in favor of the new approach. Thank you for the valuable feedback! |
What type of PR is this?
Description
Added functionality to exclude datasource types that don't need or can't perform schema refresh from the schema refresh process.
Background
Some datasource types (
results,python, etc.) don't implement theget_schemamethod, causingNotSupportedexceptions during schema refresh, which generates error logs and metrics.Error logs before fix:
These datasources don't have the concept of schema, so they should be excluded from the beginning.
Changes
Flow Diagram
Before Fix:
flowchart TD Start[refresh_schemas start] --> Loop{Each datasource} Loop --> Paused{paused?} Paused -->|Yes| SkipPaused[Skip: paused] Paused -->|No| Blacklist{blacklist?} Blacklist -->|Yes| SkipBlacklist[Skip: blacklist] Blacklist -->|No| OrgDisabled{org.is_disabled?} OrgDisabled -->|Yes| SkipOrg[Skip: org_disabled] OrgDisabled -->|No| Execute[Execute refresh_schema] Execute --> Error{NotSupported exception} Error -->|results/python| ErrorLog[❌ Error logs] Error -->|pg/mysql etc| Success[✅ Success] SkipPaused --> Loop SkipBlacklist --> Loop SkipOrg --> Loop ErrorLog --> Loop Success --> Loop Loop --> End[Complete]After Fix:
flowchart TD Start[refresh_schemas start] --> Loop{Each datasource} Loop --> Paused{paused?} Paused -->|Yes| SkipPaused[Skip: paused] Paused -->|No| Blacklist{blacklist?} Blacklist -->|Yes| SkipBlacklist[Skip: blacklist] Blacklist -->|No| TypeExcluded{type in EXCLUDED_TYPES?} TypeExcluded -->|Yes| SkipType[✅ Skip: type_excluded] TypeExcluded -->|No| OrgDisabled{org.is_disabled?} OrgDisabled -->|Yes| SkipOrg[Skip: org_disabled] OrgDisabled -->|No| Execute[Execute refresh_schema] Execute --> Success[✅ Success] SkipPaused --> Loop SkipBlacklist --> Loop SkipType --> Loop SkipOrg --> Loop Success --> Loop Loop --> End[Complete]Implementation Details
New Setting
SCHEMAS_REFRESH_EXCLUDED_TYPES: Set of datasource types to excludeREDASH_SCHEMAS_REFRESH_EXCLUDED_TYPES"results,python"(two types that definitely cause errors)Schema Refresh Logic Update
refresh_schemas()functionreason=type_excludedBenefits
Usage
Default Behavior
Without setting environment variable,
resultsandpythonare automatically excluded.Exclude Additional Types (.env file)
How is this tested?
Unit Tests
New test:
test_skips_excluded_datasource_types: Verifies excluded types are correctly skippedExisting test compatibility:
test_calls_refresh_of_all_data_sources: PASSEDtest_skips_paused_data_sources: PASSEDTest Execution Results:
Manual Testing (Verification)
Test Steps:
resultsandpythondatasourcesrefresh_schemas()Execution Command:
Execution Logs:
Verification Results:
resultsandpythoncorrectly skipped (no errors)pg(PostgreSQL) executes normally (not appearing in logs is normal)Related Tickets & Documents
Fixes #7571
Mobile & Desktop Screenshots/Recordings (if there are UI changes)
N/A (backend-only changes)
Additional Information
Implementation Approach
Initially attempted to automatically detect the presence of
get_schemamethod, but abandoned due to:hasattr()cannot detect becauseget_schemaexists inBaseQueryRunnerTherefore, adopted explicit type name specification approach. This approach:
ENABLED_QUERY_RUNNERS)Datasource Types That Don't Need Schema Refresh
The following types don't implement
get_schemamethod and are candidates for exclusion:results- Query Results (references other query results)python- Python executionBackward Compatibility
resultsandpythonin existing environments