Skip to content

Conversation

@dgokeeffe
Copy link

Add Databricks integration

Adds PySpark support for processing OpenElectricity data in Databricks environments.

What this enables

  • Convert API responses to PySpark DataFrames for Databricks processing
  • Run ETL workflows on electricity data in Databricks
  • Process large-scale electricity datasets using Spark

Usage in Databricks

from openelectricity import OEClient

client = OEClient()
facilities = client.get_facilities()

# Convert to PySpark DataFrame for Databricks
df = facilities.to_pyspark()
df.write.mode("overwrite").saveAsTable("facilities_data")

Files added

  • openelectricity/pyspark_datasource.py - Core PySpark integration
  • openelectricity/spark_utils.py - Spark utilities
  • examples/databricks/ - Databricks examples and ETL workflows
  • examples/pyspark_simple.py - Basic PySpark usage
  • PySpark test files

Modified

  • openelectricity/client.py - Added PySpark support
  • openelectricity/models/timeseries.py - Added to_pyspark() methods
  • openelectricity/models/facilities.py - Added PySpark integration
  • pyproject.toml - Added PySpark as optional dependency

Notes

  • PySpark is optional - works without it installed
  • Fully backward compatible
  • Includes Databricks-specific examples and ETL workflows

- Add PySpark data source integration with automatic schema detection
- Implement to_pyspark() methods for all response types with graceful fallbacks
- Add comprehensive PySpark test suite covering facilities, market, and network data
- Include Databricks integration examples and ETL workflows
- Add performance optimization utilities and error handling
- Support for both local PySpark and Databricks environments
- PySpark is completely optional - SDK works without it installed
- Add examples demonstrating PySpark functionality and fallbacks
@dgokeeffe dgokeeffe changed the title feat: add PySpark integration for large-scale data processing feat: add Databricks integration Sep 6, 2025
dgokeeffe and others added 10 commits September 6, 2025 14:18
- Add test_facilities_data.py: Comprehensive testing of facility data parsing and validation with real API responses
- Add test_market_metrics.py: Test market metrics functionality and API response handling
- Add test_sync_client.py: Complete test suite for synchronous OEClient implementation including error handling, session management, and API methods
- Add test_timezone_handling.py: Test timezone handling in PySpark DataFrame conversions
- Add tests/conftest.py: Centralized pytest fixtures for API keys, clients, and test configuration
- Add tests/README.md: Comprehensive documentation for test suite setup, running, and fixture usage
- Update pyproject.toml: Register custom pytest markers (slow, integration) to eliminate warnings

The test suite includes:
- Unit tests for client initialization and configuration
- Integration tests for API endpoints (facilities, market, network data)
- PySpark DataFrame conversion tests with timezone handling
- Error handling and edge case testing
- Proper fixture management with graceful skipping when dependencies unavailable
- Comprehensive documentation for test setup and execution

All tests pass with proper skipping for missing API keys or dependencies.
@dgokeeffe
Copy link
Author

Any chance we can merge in @nc9?

@dgokeeffe
Copy link
Author

@nc9

I can resolve the merge conflicts, but after that, can we proceed with the merge, please?

- Remove pydantic-settings dependency
- Simplify conftest.py fixtures
- Add new examples and type exports
- Update to version 0.9.3
- Remove settings_schema.py module
- Fixed _build_url method that was duplicating /v4 in endpoint URLs
- Resolves 404 errors when calling market API endpoints
- Added diagnostic logging to databricks_etl.py
- Corrected demand_energy unit label from GWh to MWh"
Extract location data from nested location object in facilities API
response to enable geospatial analysis.

Changes:
- Update to_records() and to_pyspark() to extract latitude/longitude
- Handle missing location data gracefully (None values)
- Add 9 tests for location extraction functionality
- Update existing tests to expect new columns

DataFrame output now includes latitude and longitude columns.
Add `upload-databricks` target that builds and uploads wheel to
Unity Catalog volume using upload_wheel_to_volume.py script.

Usage:
  make upload-databricks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant