-
Notifications
You must be signed in to change notification settings - Fork 4
feat: add Databricks integration #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dgokeeffe
wants to merge
16
commits into
opennem:main
Choose a base branch
from
dgokeeffe:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Add PySpark data source integration with automatic schema detection - Implement to_pyspark() methods for all response types with graceful fallbacks - Add comprehensive PySpark test suite covering facilities, market, and network data - Include Databricks integration examples and ETL workflows - Add performance optimization utilities and error handling - Support for both local PySpark and Databricks environments - PySpark is completely optional - SDK works without it installed - Add examples demonstrating PySpark functionality and fallbacks
- Add test_facilities_data.py: Comprehensive testing of facility data parsing and validation with real API responses - Add test_market_metrics.py: Test market metrics functionality and API response handling - Add test_sync_client.py: Complete test suite for synchronous OEClient implementation including error handling, session management, and API methods - Add test_timezone_handling.py: Test timezone handling in PySpark DataFrame conversions - Add tests/conftest.py: Centralized pytest fixtures for API keys, clients, and test configuration - Add tests/README.md: Comprehensive documentation for test suite setup, running, and fixture usage - Update pyproject.toml: Register custom pytest markers (slow, integration) to eliminate warnings The test suite includes: - Unit tests for client initialization and configuration - Integration tests for API endpoints (facilities, market, network data) - PySpark DataFrame conversion tests with timezone handling - Error handling and edge case testing - Proper fixture management with graceful skipping when dependencies unavailable - Comprehensive documentation for test setup and execution All tests pass with proper skipping for missing API keys or dependencies.
Author
|
Any chance we can merge in @nc9? |
Author
|
I can resolve the merge conflicts, but after that, can we proceed with the merge, please? |
- Remove pydantic-settings dependency - Simplify conftest.py fixtures - Add new examples and type exports - Update to version 0.9.3 - Remove settings_schema.py module
- Fixed _build_url method that was duplicating /v4 in endpoint URLs - Resolves 404 errors when calling market API endpoints - Added diagnostic logging to databricks_etl.py - Corrected demand_energy unit label from GWh to MWh"
Extract location data from nested location object in facilities API response to enable geospatial analysis. Changes: - Update to_records() and to_pyspark() to extract latitude/longitude - Handle missing location data gracefully (None values) - Add 9 tests for location extraction functionality - Update existing tests to expect new columns DataFrame output now includes latitude and longitude columns.
Add `upload-databricks` target that builds and uploads wheel to Unity Catalog volume using upload_wheel_to_volume.py script. Usage: make upload-databricks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add Databricks integration
Adds PySpark support for processing OpenElectricity data in Databricks environments.
What this enables
Usage in Databricks
Files added
openelectricity/pyspark_datasource.py- Core PySpark integrationopenelectricity/spark_utils.py- Spark utilitiesexamples/databricks/- Databricks examples and ETL workflowsexamples/pyspark_simple.py- Basic PySpark usageModified
openelectricity/client.py- Added PySpark supportopenelectricity/models/timeseries.py- Addedto_pyspark()methodsopenelectricity/models/facilities.py- Added PySpark integrationpyproject.toml- Added PySpark as optional dependencyNotes