Skip to content

Conversation

@snickerjp
Copy link
Member

What type of PR is this?

  • Feature

Description

Added functionality to exclude datasource types that don't need or can't perform schema refresh from the schema refresh process.

Background

Some datasource types (results, python, etc.) don't implement the get_schema method, causing NotSupported exceptions during schema refresh, which generates error logs and metrics.

Error logs before fix:

[WARNING] Failed refreshing schema for the data source: Query Results
Traceback (most recent call last):
  File "/app/redash/tasks/queries/maintenance.py", line 166, in refresh_schema
    ds.get_schema(refresh=True)
  File "/app/redash/query_runner/__init__.py", line 232, in get_schema
    raise NotSupported()
redash.query_runner.NotSupported
[INFO] task=refresh_schema state=failed ds_id=1 runtime=0.00

[WARNING] Failed refreshing schema for the data source: python
Traceback (most recent call last):
  ...
redash.query_runner.NotSupported
[INFO] task=refresh_schema state=failed ds_id=2 runtime=0.00

These datasources don't have the concept of schema, so they should be excluded from the beginning.

Changes

Flow Diagram

Before Fix:

flowchart TD
    Start[refresh_schemas start] --> Loop{Each datasource}
    Loop --> Paused{paused?}
    Paused -->|Yes| SkipPaused[Skip: paused]
    Paused -->|No| Blacklist{blacklist?}
    Blacklist -->|Yes| SkipBlacklist[Skip: blacklist]
    Blacklist -->|No| OrgDisabled{org.is_disabled?}
    OrgDisabled -->|Yes| SkipOrg[Skip: org_disabled]
    OrgDisabled -->|No| Execute[Execute refresh_schema]
    Execute --> Error{NotSupported exception}
    Error -->|results/python| ErrorLog[❌ Error logs]
    Error -->|pg/mysql etc| Success[✅ Success]
    SkipPaused --> Loop
    SkipBlacklist --> Loop
    SkipOrg --> Loop
    ErrorLog --> Loop
    Success --> Loop
    Loop --> End[Complete]
Loading

After Fix:

flowchart TD
    Start[refresh_schemas start] --> Loop{Each datasource}
    Loop --> Paused{paused?}
    Paused -->|Yes| SkipPaused[Skip: paused]
    Paused -->|No| Blacklist{blacklist?}
    Blacklist -->|Yes| SkipBlacklist[Skip: blacklist]
    Blacklist -->|No| TypeExcluded{type in EXCLUDED_TYPES?}
    TypeExcluded -->|Yes| SkipType[✅ Skip: type_excluded]
    TypeExcluded -->|No| OrgDisabled{org.is_disabled?}
    OrgDisabled -->|Yes| SkipOrg[Skip: org_disabled]
    OrgDisabled -->|No| Execute[Execute refresh_schema]
    Execute --> Success[✅ Success]
    SkipPaused --> Loop
    SkipBlacklist --> Loop
    SkipType --> Loop
    SkipOrg --> Loop
    Success --> Loop
    Loop --> End[Complete]
Loading

Implementation Details

  1. New Setting

    • SCHEMAS_REFRESH_EXCLUDED_TYPES: Set of datasource types to exclude
    • Environment variable: REDASH_SCHEMAS_REFRESH_EXCLUDED_TYPES
    • Default value: "results,python" (two types that definitely cause errors)
  2. Schema Refresh Logic Update

    • Added type exclusion check in refresh_schemas() function
    • Excluded types are logged with reason=type_excluded
    • Maintains consistency with existing exclusion mechanisms (blacklist, paused, org.is_disabled)

Benefits

  • Reduces unnecessary error logs and metrics
  • Prevents wasteful endpoint access
  • Improves schema refresh process efficiency

Usage

Default Behavior

Without setting environment variable, results and python are automatically excluded.

Exclude Additional Types (.env file)

REDASH_SCHEMAS_REFRESH_EXCLUDED_TYPES=results,python,json,url

How is this tested?

  • Unit tests (pytest)
  • Manually

Unit Tests

New test:

  • test_skips_excluded_datasource_types: Verifies excluded types are correctly skipped

Existing test compatibility:

  • test_calls_refresh_of_all_data_sources: PASSED
  • test_skips_paused_data_sources: PASSED

Test Execution Results:

3 passed, 21 warnings in 9.07s
✅ test_calls_refresh_of_all_data_sources PASSED
✅ test_skips_excluded_datasource_types PASSED
✅ test_skips_paused_data_sources PASSED

Manual Testing (Verification)

Test Steps:

  1. Create results and python datasources
  2. Execute refresh_schemas()
  3. Check logs

Execution Command:

docker compose exec worker python -c "
from redash import create_app
from redash.tasks.queries.maintenance import refresh_schemas
from redash import models

app = create_app()
with app.app_context():
    print('=== Data sources ===')
    for ds in models.DataSource.query:
        print(f'ID={ds.id} Name={ds.name} Type={ds.type}')
    print()
    print('=== Running refresh_schemas ===')
    refresh_schemas()
"

Execution Logs:

=== Data sources ===
ID=1 Name=Query Results Type=results
ID=2 Name=python Type=python
ID=3 Name=redash Type=pg

=== Running refresh_schemas ===
[INFO] task=refresh_schemas state=start
[INFO] task=refresh_schema state=skip ds_id=1 reason=type_excluded
[INFO] task=refresh_schema state=skip ds_id=2 reason=type_excluded
[INFO] task=refresh_schemas state=finish total_runtime=0.01

Verification Results:

  • results and python correctly skipped (no errors)
  • pg (PostgreSQL) executes normally (not appearing in logs is normal)
  • ✅ Error logs and stack traces completely eliminated

Related Tickets & Documents

Fixes #7571

Mobile & Desktop Screenshots/Recordings (if there are UI changes)

N/A (backend-only changes)


Additional Information

Implementation Approach

Initially attempted to automatically detect the presence of get_schema method, but abandoned due to:

  • hasattr() cannot detect because get_schema exists in BaseQueryRunner
  • Checking method override is complex and has low maintainability
  • Exception catching approach has performance impact

Therefore, adopted explicit type name specification approach. This approach:

  • Simple and easy to understand
  • Works reliably
  • Flexible control via environment variables
  • Consistent with other Redash settings (like ENABLED_QUERY_RUNNERS)

Datasource Types That Don't Need Schema Refresh

The following types don't implement get_schema method and are candidates for exclusion:

  • results - Query Results (references other query results)
  • python - Python execution
  • And potentially many others

Backward Compatibility

  • Default value automatically excludes results and python in existing environments
  • Can revert to previous behavior (attempt all datasources) by setting environment variable to empty string
  • Does not affect existing exclusion mechanisms (blacklist, paused, org.is_disabled)

- Add SCHEMAS_REFRESH_EXCLUDED_TYPES setting with default 'results,python'
- Add type-based exclusion check in refresh_schemas()
- Prevents unnecessary errors for datasources without schema support
@yoshiokatsuneo
Copy link
Contributor

yoshiokatsuneo commented Nov 14, 2025

Thank you for your PR with the detailed description !

Just a question.

Exception catching approach has performance impact

May I hear what kind of performance impact you are worrying ?
I just thought there is also an option to ignore NotSupported exception.

@snickerjp
Copy link
Member Author

snickerjp commented Nov 14, 2025

Thank you for the question!

You're right - the performance impact of exception catching would be minimal in this case. The concern was more about the implementation approach rather than actual performance.

The exception catching approach would look like:

try:
   ds.query_runner.get_schema(get_stats=False)
   refresh_schema.delay(ds.id)
except NotSupported:
   logger.info("skip: no schema support")

However, this approach has a conceptual issue: we'd be calling get_schema() just to check if it's supported, which feels wrong because:

  1. get_schema() is meant to actually retrieve schema, not to check capability
  2. Even with get_stats=False, it might still initialize connections or perform setup
  3. It's semantically unclear - the code looks like it's trying to get schema, but it's actually just checking support

Additionally, when there are many datasources:

  • Exception catching would call get_schema() for every datasource during refresh_schemas() execution (every 30 minutes by default)
  • Some query runners might initialize connections when accessing the query_runner property
  • Python exception handling has overhead (stack unwinding, traceback creation)

With type-based exclusion:

  • Skip check happens before any query runner instantiation
  • O(1) set lookup: ds.type in EXCLUDED_TYPES
  • No method calls, no exceptions, no overhead

That said, the performance difference is likely negligible in practice. The main benefit is code clarity and avoiding unnecessary method calls.

If the maintainers prefer the exception catching approach for better automatic detection, I'm happy to change it. What do you think?

@yoshiokatsuneo
Copy link
Contributor

yoshiokatsuneo commented Nov 15, 2025

@snickerjp

Thank you very much for you detailed explanation !

However, this approach has a conceptual issue: we'd be calling get_schema() just to check if it's supported, which feels wrong because:
get_schema() is meant to actually retrieve schema, not to check capability

Yes, but I'm just feeling, if we just ignore the exception, it is not "checking" but just "ignoring".

Even with get_stats=False, it might still initialize connections or perform setup

I think, at least for query_results / python data sources you described, calling get_schema() does not initialize the connections.

It's semantically unclear - the code looks like it's trying to get schema, but it's actually just checking support

I'm just feeling that at the point we ignore the error, the original issue was already solved.

Exception catching would call get_schema() for every datasource during refresh_schemas() execution (every 30 minutes by default)

Yes, it might be meaningless. (Although, performance impact will be minimum.)

Python exception handling has overhead (stack unwinding, traceback creation)

I think the impact is very little. (Probably less than 0.1sec ?)

What I'm feeling is that the attributes(ex: schema listing is supported or not.) for each Data Source is nice to be encapsulated inside each Data Source class, is not defined at the global variables, if possible.
If we need to detect whether each Data Source support get_schema or not, I think we may add a method(ex: "is_get_method_supported"?) to the each Data Source class. (Although it make be the change bigger, and I'm not sure whether it is worth to do when the main issue(error logging) is already solved.)

How about ?

@snickerjp
Copy link
Member Author

@yoshiokatsuneo

Thank you for the thoughtful feedback!

1. I understand your concern about encapsulation

You're absolutely right - ideally, each DataSource class should know its own capabilities rather than relying on global configuration. Adding is_get_schema_supported() method to each DataSource class would be the cleanest architectural solution.

2. The practical challenge

However, implementing this properly would require:

  • Modifying 69 DataSource files (all files in redash/query_runner/)
  • Adding the is_get_schema_supported() method to each query runner class
  • Comprehensive testing across all data sources
  • All community developers who create new data sources would need to understand and implement this method

While this is architecturally ideal, it introduces additional complexity for the community. At this point in time, requiring every data source developer to be aware of and implement this method doesn't seem like the best approach for the ecosystem.

3. Proposed approach: Empty default value

To address your encapsulation concern, I propose changing the default to empty string "":

SCHEMAS_REFRESH_EXCLUDED_TYPES = set_from_string(
    os.environ.get("REDASH_SCHEMAS_REFRESH_EXCLUDED_TYPES", "")
)

This way:

  • By default, nothing is excluded - the code makes no assumptions about which types should be excluded
  • Operators who encounter errors can explicitly set REDASH_SCHEMAS_REFRESH_EXCLUDED_TYPES="results,python"
  • The implementation becomes a pure "opt-in exclusion mechanism"
  • Fully controlled by environment variable, not hardcoded

This keeps the door open for future improvements (like is_get_schema_supported()) while solving the immediate error logging issue for operators who need it.


To be clear about the goal of this PR:

I want to add a mechanism to exclude data sources from the periodic refresh_schemas() task that runs in the worker. Specifically:

  • Data sources that don't support get_schema()
  • Data sources that operators explicitly want to exclude

This prevents unnecessary error logs from being generated every 30 minutes for data sources that will never support schema refresh.

What do you think about this approach?

- Remove hardcoded 'results,python' default
- Allow operators to opt-in to exclusion via environment variable
- Code makes no assumptions about which types should be excluded
@snickerjp snickerjp force-pushed the feature/exclude-datasource-types-from-schema-refresh branch from f5515e2 to f173962 Compare November 16, 2025 17:46
@yoshiokatsuneo
Copy link
Contributor

yoshiokatsuneo commented Nov 17, 2025

@snickerjp

Thank you for your detailed comment !
I agree that editing the 69 files is a bit of hassle.

And, may we make clear the purpose of the this PR ?
The PR description says there are three purpose in this PR.

  1. Reduces unnecessary error logs and metrics
  2. Prevents wasteful endpoint access
  3. Improves schema refresh process efficiency

I can understand the purpose 1 .

For about 2, I think this PR does not solve the problem (if the "endpoint" means API endpoint.)
For about 3, I don't feel there is any performance problem.

And, if the the problem this PR need to solve only 1, I feel we can just remove the error logs and we don't need to introduce extra global variable.

How about ?

@snickerjp
Copy link
Member Author

@yoshiokatsuneo

Thank you for pointing this out! You're absolutely right.

After investigation, I found:

About Purpose 2 (Preventing endpoint access)

I verified the actual code and confirmed that no API access occurs before the NotSupported exception is raised.

Example: google_spreadsheets

def get_schema(self, get_stats=False):
    raise NotSupported()  # Exception raised immediately

def _get_spreadsheet_service(self):
    # API connection happens here, but NOT called from get_schema()
    # Only called from run_query()

Other unsupported datasources (json, url, python, results, etc.) also raise NotSupported immediately in get_schema() without any external access.

About Purpose 3 (Performance improvement)

I measured the exception handling overhead and found it negligible (approximately 0.01 seconds per datasource).

Conclusion

The actual problem to solve is only Purpose 1 (cleaning up error logs).

As you suggested, the exception handling approach is simpler and more appropriate. I've created a new PR #7573 which:

  • Requires no global configuration
  • Catches NotSupported exception in refresh_schema() and logs at DEBUG level
  • Only 3 lines of changes

I'll close this PR in favor of the new approach. Thank you for the valuable feedback!

@snickerjp snickerjp closed this Nov 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Schema Refresh Fails for Certain Datasource Types

2 participants