Add Datasource Type Exclusion from Schema Refresh #7572

snickerjp · 2025-11-14T15:47:36Z

What type of PR is this?

Feature

Description

Added functionality to exclude datasource types that don't need or can't perform schema refresh from the schema refresh process.

Background

Some datasource types (results, python, etc.) don't implement the get_schema method, causing NotSupported exceptions during schema refresh, which generates error logs and metrics.

Error logs before fix:

[WARNING] Failed refreshing schema for the data source: Query Results
Traceback (most recent call last):
  File "/app/redash/tasks/queries/maintenance.py", line 166, in refresh_schema
    ds.get_schema(refresh=True)
  File "/app/redash/query_runner/__init__.py", line 232, in get_schema
    raise NotSupported()
redash.query_runner.NotSupported
[INFO] task=refresh_schema state=failed ds_id=1 runtime=0.00

[WARNING] Failed refreshing schema for the data source: python
Traceback (most recent call last):
  ...
redash.query_runner.NotSupported
[INFO] task=refresh_schema state=failed ds_id=2 runtime=0.00

These datasources don't have the concept of schema, so they should be excluded from the beginning.

Changes

Flow Diagram

Before Fix:

flowchart TD
    Start[refresh_schemas start] --> Loop{Each datasource}
    Loop --> Paused{paused?}
    Paused -->|Yes| SkipPaused[Skip: paused]
    Paused -->|No| Blacklist{blacklist?}
    Blacklist -->|Yes| SkipBlacklist[Skip: blacklist]
    Blacklist -->|No| OrgDisabled{org.is_disabled?}
    OrgDisabled -->|Yes| SkipOrg[Skip: org_disabled]
    OrgDisabled -->|No| Execute[Execute refresh_schema]
    Execute --> Error{NotSupported exception}
    Error -->|results/python| ErrorLog[❌ Error logs]
    Error -->|pg/mysql etc| Success[✅ Success]
    SkipPaused --> Loop
    SkipBlacklist --> Loop
    SkipOrg --> Loop
    ErrorLog --> Loop
    Success --> Loop
    Loop --> End[Complete]

After Fix:

flowchart TD
    Start[refresh_schemas start] --> Loop{Each datasource}
    Loop --> Paused{paused?}
    Paused -->|Yes| SkipPaused[Skip: paused]
    Paused -->|No| Blacklist{blacklist?}
    Blacklist -->|Yes| SkipBlacklist[Skip: blacklist]
    Blacklist -->|No| TypeExcluded{type in EXCLUDED_TYPES?}
    TypeExcluded -->|Yes| SkipType[✅ Skip: type_excluded]
    TypeExcluded -->|No| OrgDisabled{org.is_disabled?}
    OrgDisabled -->|Yes| SkipOrg[Skip: org_disabled]
    OrgDisabled -->|No| Execute[Execute refresh_schema]
    Execute --> Success[✅ Success]
    SkipPaused --> Loop
    SkipBlacklist --> Loop
    SkipType --> Loop
    SkipOrg --> Loop
    Success --> Loop
    Loop --> End[Complete]

Implementation Details

New Setting
- SCHEMAS_REFRESH_EXCLUDED_TYPES: Set of datasource types to exclude
- Environment variable: REDASH_SCHEMAS_REFRESH_EXCLUDED_TYPES
- Default value: "results,python" (two types that definitely cause errors)
Schema Refresh Logic Update
- Added type exclusion check in refresh_schemas() function
- Excluded types are logged with reason=type_excluded
- Maintains consistency with existing exclusion mechanisms (blacklist, paused, org.is_disabled)

Benefits

Reduces unnecessary error logs and metrics
Prevents wasteful endpoint access
Improves schema refresh process efficiency

Usage

Default Behavior

Without setting environment variable, results and python are automatically excluded.

Exclude Additional Types (.env file)

REDASH_SCHEMAS_REFRESH_EXCLUDED_TYPES=results,python,json,url

How is this tested?

Unit tests (pytest)
Manually

Unit Tests

New test:

test_skips_excluded_datasource_types: Verifies excluded types are correctly skipped

Existing test compatibility:

test_calls_refresh_of_all_data_sources: PASSED
test_skips_paused_data_sources: PASSED

Test Execution Results:

3 passed, 21 warnings in 9.07s
✅ test_calls_refresh_of_all_data_sources PASSED
✅ test_skips_excluded_datasource_types PASSED
✅ test_skips_paused_data_sources PASSED

Manual Testing (Verification)

Test Steps:

Create results and python datasources
Execute refresh_schemas()
Check logs

Execution Command:

docker compose exec worker python -c "
from redash import create_app
from redash.tasks.queries.maintenance import refresh_schemas
from redash import models

app = create_app()
with app.app_context():
    print('=== Data sources ===')
    for ds in models.DataSource.query:
        print(f'ID={ds.id} Name={ds.name} Type={ds.type}')
    print()
    print('=== Running refresh_schemas ===')
    refresh_schemas()
"

Execution Logs:

=== Data sources ===
ID=1 Name=Query Results Type=results
ID=2 Name=python Type=python
ID=3 Name=redash Type=pg

=== Running refresh_schemas ===
[INFO] task=refresh_schemas state=start
[INFO] task=refresh_schema state=skip ds_id=1 reason=type_excluded
[INFO] task=refresh_schema state=skip ds_id=2 reason=type_excluded
[INFO] task=refresh_schemas state=finish total_runtime=0.01

Verification Results:

✅ results and python correctly skipped (no errors)
✅ pg (PostgreSQL) executes normally (not appearing in logs is normal)
✅ Error logs and stack traces completely eliminated

Related Tickets & Documents

Fixes #7571

Mobile & Desktop Screenshots/Recordings (if there are UI changes)

N/A (backend-only changes)

Additional Information

Implementation Approach

Initially attempted to automatically detect the presence of get_schema method, but abandoned due to:

hasattr() cannot detect because get_schema exists in BaseQueryRunner
Checking method override is complex and has low maintainability
Exception catching approach has performance impact

Therefore, adopted explicit type name specification approach. This approach:

Simple and easy to understand
Works reliably
Flexible control via environment variables
Consistent with other Redash settings (like ENABLED_QUERY_RUNNERS)

Datasource Types That Don't Need Schema Refresh

The following types don't implement get_schema method and are candidates for exclusion:

results - Query Results (references other query results)
python - Python execution
And potentially many others

Backward Compatibility

Default value automatically excludes results and python in existing environments
Can revert to previous behavior (attempt all datasources) by setting environment variable to empty string
Does not affect existing exclusion mechanisms (blacklist, paused, org.is_disabled)

- Add SCHEMAS_REFRESH_EXCLUDED_TYPES setting with default 'results,python' - Add type-based exclusion check in refresh_schemas() - Prevents unnecessary errors for datasources without schema support

yoshiokatsuneo · 2025-11-14T16:16:06Z

Thank you for your PR with the detailed description !

Just a question.

Exception catching approach has performance impact

May I hear what kind of performance impact you are worrying ?
I just thought there is also an option to ignore NotSupported exception.

snickerjp · 2025-11-14T17:15:35Z

Thank you for the question!

You're right - the performance impact of exception catching would be minimal in this case. The concern was more about the implementation approach rather than actual performance.

The exception catching approach would look like:

try:
   ds.query_runner.get_schema(get_stats=False)
   refresh_schema.delay(ds.id)
except NotSupported:
   logger.info("skip: no schema support")

However, this approach has a conceptual issue: we'd be calling get_schema() just to check if it's supported, which feels wrong because:

get_schema() is meant to actually retrieve schema, not to check capability
Even with get_stats=False, it might still initialize connections or perform setup
It's semantically unclear - the code looks like it's trying to get schema, but it's actually just checking support

Additionally, when there are many datasources:

Exception catching would call get_schema() for every datasource during refresh_schemas() execution (every 30 minutes by default)
Some query runners might initialize connections when accessing the query_runner property
Python exception handling has overhead (stack unwinding, traceback creation)

With type-based exclusion:

Skip check happens before any query runner instantiation
O(1) set lookup: ds.type in EXCLUDED_TYPES
No method calls, no exceptions, no overhead

That said, the performance difference is likely negligible in practice. The main benefit is code clarity and avoiding unnecessary method calls.

If the maintainers prefer the exception catching approach for better automatic detection, I'm happy to change it. What do you think?

yoshiokatsuneo · 2025-11-15T07:23:24Z

@snickerjp

Thank you very much for you detailed explanation !

However, this approach has a conceptual issue: we'd be calling get_schema() just to check if it's supported, which feels wrong because:
get_schema() is meant to actually retrieve schema, not to check capability

Yes, but I'm just feeling, if we just ignore the exception, it is not "checking" but just "ignoring".

Even with get_stats=False, it might still initialize connections or perform setup

I think, at least for query_results / python data sources you described, calling get_schema() does not initialize the connections.

It's semantically unclear - the code looks like it's trying to get schema, but it's actually just checking support

I'm just feeling that at the point we ignore the error, the original issue was already solved.

Exception catching would call get_schema() for every datasource during refresh_schemas() execution (every 30 minutes by default)

Yes, it might be meaningless. (Although, performance impact will be minimum.)

Python exception handling has overhead (stack unwinding, traceback creation)

I think the impact is very little. (Probably less than 0.1sec ?)

What I'm feeling is that the attributes(ex: schema listing is supported or not.) for each Data Source is nice to be encapsulated inside each Data Source class, is not defined at the global variables, if possible.
If we need to detect whether each Data Source support get_schema or not, I think we may add a method(ex: "is_get_method_supported"?) to the each Data Source class. (Although it make be the change bigger, and I'm not sure whether it is worth to do when the main issue(error logging) is already solved.)

How about ?

snickerjp · 2025-11-16T17:42:08Z

@yoshiokatsuneo

Thank you for the thoughtful feedback!

1. I understand your concern about encapsulation

You're absolutely right - ideally, each DataSource class should know its own capabilities rather than relying on global configuration. Adding is_get_schema_supported() method to each DataSource class would be the cleanest architectural solution.

2. The practical challenge

However, implementing this properly would require:

Modifying 69 DataSource files (all files in redash/query_runner/)
Adding the is_get_schema_supported() method to each query runner class
Comprehensive testing across all data sources
All community developers who create new data sources would need to understand and implement this method

While this is architecturally ideal, it introduces additional complexity for the community. At this point in time, requiring every data source developer to be aware of and implement this method doesn't seem like the best approach for the ecosystem.

3. Proposed approach: Empty default value

To address your encapsulation concern, I propose changing the default to empty string "":

SCHEMAS_REFRESH_EXCLUDED_TYPES = set_from_string(
    os.environ.get("REDASH_SCHEMAS_REFRESH_EXCLUDED_TYPES", "")
)

This way:

By default, nothing is excluded - the code makes no assumptions about which types should be excluded
Operators who encounter errors can explicitly set REDASH_SCHEMAS_REFRESH_EXCLUDED_TYPES="results,python"
The implementation becomes a pure "opt-in exclusion mechanism"
Fully controlled by environment variable, not hardcoded

This keeps the door open for future improvements (like is_get_schema_supported()) while solving the immediate error logging issue for operators who need it.

To be clear about the goal of this PR:

I want to add a mechanism to exclude data sources from the periodic refresh_schemas() task that runs in the worker. Specifically:

Data sources that don't support get_schema()
Data sources that operators explicitly want to exclude

This prevents unnecessary error logs from being generated every 30 minutes for data sources that will never support schema refresh.

What do you think about this approach?

- Remove hardcoded 'results,python' default - Allow operators to opt-in to exclusion via environment variable - Code makes no assumptions about which types should be excluded

yoshiokatsuneo · 2025-11-17T14:54:30Z

@snickerjp

Thank you for your detailed comment !
I agree that editing the 69 files is a bit of hassle.

And, may we make clear the purpose of the this PR ?
The PR description says there are three purpose in this PR.

Reduces unnecessary error logs and metrics
Prevents wasteful endpoint access
Improves schema refresh process efficiency

I can understand the purpose 1 .

For about 2, I think this PR does not solve the problem (if the "endpoint" means API endpoint.)
For about 3, I don't feel there is any performance problem.

And, if the the problem this PR need to solve only 1, I feel we can just remove the error logs and we don't need to introduce extra global variable.

How about ?

snickerjp · 2025-11-17T17:33:15Z

@yoshiokatsuneo

Thank you for pointing this out! You're absolutely right.

After investigation, I found:

About Purpose 2 (Preventing endpoint access)

I verified the actual code and confirmed that no API access occurs before the NotSupported exception is raised.

Example: google_spreadsheets

def get_schema(self, get_stats=False):
    raise NotSupported()  # Exception raised immediately

def _get_spreadsheet_service(self):
    # API connection happens here, but NOT called from get_schema()
    # Only called from run_query()

Other unsupported datasources (json, url, python, results, etc.) also raise NotSupported immediately in get_schema() without any external access.

About Purpose 3 (Performance improvement)

I measured the exception handling overhead and found it negligible (approximately 0.01 seconds per datasource).

Conclusion

The actual problem to solve is only Purpose 1 (cleaning up error logs).

As you suggested, the exception handling approach is simpler and more appropriate. I've created a new PR #7573 which:

Requires no global configuration
Catches NotSupported exception in refresh_schema() and logs at DEBUG level
Only 3 lines of changes

I'll close this PR in favor of the new approach. Thank you for the valuable feedback!

snickerjp added 3 commits November 14, 2025 14:49

Add datasource type exclusion from schema refresh

32cef44

- Add SCHEMAS_REFRESH_EXCLUDED_TYPES setting with default 'results,python' - Add type-based exclusion check in refresh_schemas() - Prevents unnecessary errors for datasources without schema support

Add test for datasource type exclusion

6be5135

Fix ruff W293: Remove whitespace from blank line

b2ef5e7

Change default SCHEMAS_REFRESH_EXCLUDED_TYPES to empty string

f173962

- Remove hardcoded 'results,python' default - Allow operators to opt-in to exclusion via environment variable - Code makes no assumptions about which types should be excluded

snickerjp force-pushed the feature/exclude-datasource-types-from-schema-refresh branch from f5515e2 to f173962 Compare November 16, 2025 17:46

snickerjp mentioned this pull request Nov 17, 2025

Feature/catch notsupported exception #7573

Open

4 tasks

snickerjp closed this Nov 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Datasource Type Exclusion from Schema Refresh #7572

Add Datasource Type Exclusion from Schema Refresh #7572

Uh oh!

snickerjp commented Nov 14, 2025

Uh oh!

yoshiokatsuneo commented Nov 14, 2025 •

edited

Loading

Uh oh!

snickerjp commented Nov 14, 2025 •

edited

Loading

Uh oh!

yoshiokatsuneo commented Nov 15, 2025 •

edited

Loading

Uh oh!

snickerjp commented Nov 16, 2025

Uh oh!

yoshiokatsuneo commented Nov 17, 2025 •

edited

Loading

Uh oh!

snickerjp commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Datasource Type Exclusion from Schema Refresh #7572

Add Datasource Type Exclusion from Schema Refresh #7572

Uh oh!

Conversation

snickerjp commented Nov 14, 2025

What type of PR is this?

Description

Background

Changes

Flow Diagram

Implementation Details

Benefits

Usage

Default Behavior

Exclude Additional Types (.env file)

How is this tested?

Unit Tests

Manual Testing (Verification)

Related Tickets & Documents

Mobile & Desktop Screenshots/Recordings (if there are UI changes)

Additional Information

Implementation Approach

Datasource Types That Don't Need Schema Refresh

Backward Compatibility

Uh oh!

yoshiokatsuneo commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snickerjp commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yoshiokatsuneo commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snickerjp commented Nov 16, 2025

1. I understand your concern about encapsulation

2. The practical challenge

3. Proposed approach: Empty default value

Uh oh!

yoshiokatsuneo commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snickerjp commented Nov 17, 2025

About Purpose 2 (Preventing endpoint access)

About Purpose 3 (Performance improvement)

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yoshiokatsuneo commented Nov 14, 2025 •

edited

Loading

snickerjp commented Nov 14, 2025 •

edited

Loading

yoshiokatsuneo commented Nov 15, 2025 •

edited

Loading

yoshiokatsuneo commented Nov 17, 2025 •

edited

Loading