-
Notifications
You must be signed in to change notification settings - Fork 0
Update wide hypertable DDL #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Update wide hypertable DDL #31
Conversation
WalkthroughThis change transitions the data ingestion and feature extraction pipeline from using the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant PreprocessScript
participant Database
participant FeatureExtractionScript
User->>PreprocessScript: Run preprocessing for an era
PreprocessScript->>Database: Create/initialize preprocessed_wide table (via SQL script)
PreprocessScript->>Database: Insert wide-format features (save_wide_to_timescaledb)
User->>FeatureExtractionScript: Run feature extraction
FeatureExtractionScript->>Database: Query all columns from preprocessed_wide
FeatureExtractionScript->>User: Output extracted features
Possibly related PRs
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
DataIngestion/feature_extraction/pre_process/report_for_preprocess/create_preprocessed_wide.sql (1)
16-21: Hypertable creation parameters look good
Usingif_not_exists => TRUEavoids failures if the hypertable has already been created, and a 7-daychunk_time_intervalis a reasonable default. You might consider adding a retention policy later (e.g., viaadd_retention_policy) if old data needs automated cleanup.DataIngestion/feature_extraction/pre_process/prepare_era_data.py (2)
229-233: Verify engine creation error handling.The engine creation looks correct, but consider adding error handling around the engine creation since database connection failures could occur here.
+ try: era_to_process = config.get("common_settings", {}).get("default_era_to_process_for_script", "Era1") conn_details = get_db_connection_details(config) engine = create_engine( f"postgresql://{conn_details['user']}:{conn_details['password']}@{conn_details['host']}:{conn_details['port']}/{conn_details['dbname']}" ) + except Exception as e: + print(f"Error creating database engine: {e}") + return
294-295: Improve error handling for engine disposal.The finally block correctly ensures engine disposal, but should handle potential disposal errors and ensure the engine variable exists.
finally: - engine.dispose() + if 'engine' in locals() and engine: + try: + engine.dispose() + except Exception as e: + print(f"Error disposing database engine: {e}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
DataIngestion/docker-compose.yml(1 hunks)DataIngestion/feature_extraction/era_detection_rust/src/main.rs(1 hunks)DataIngestion/feature_extraction/feature-gpu/extract_features_gpu.py(1 hunks)DataIngestion/feature_extraction/feature-gpu/feature_classes/test.py(2 hunks)DataIngestion/feature_extraction/feature/extract_features.py(1 hunks)DataIngestion/feature_extraction/pre_process/create_preprocessed_wide.sql(1 hunks)DataIngestion/feature_extraction/pre_process/database_operations.py(1 hunks)DataIngestion/feature_extraction/pre_process/prepare_era_data.py(4 hunks)DataIngestion/feature_extraction/pre_process/preprocess.dockerfile(1 hunks)DataIngestion/feature_extraction/pre_process/preprocess.py(4 hunks)DataIngestion/feature_extraction/pre_process/report_for_preprocess/create_preprocessed_wide.sql(1 hunks)DataIngestion/feature_extraction/pre_process/requirements.txt(0 hunks)
💤 Files with no reviewable changes (1)
- DataIngestion/feature_extraction/pre_process/requirements.txt
🧰 Additional context used
🧬 Code Graph Analysis (2)
DataIngestion/feature_extraction/feature-gpu/extract_features_gpu.py (1)
DataIngestion/feature_extraction/feature-gpu/feature_classes/test.py (1)
fetch_data_to_pandas(47-56)
DataIngestion/feature_extraction/pre_process/preprocess.py (1)
DataIngestion/feature_extraction/pre_process/database_operations.py (2)
verify_table_exists(117-136)save_wide_to_timescaledb(138-158)
🔇 Additional comments (21)
DataIngestion/feature_extraction/pre_process/report_for_preprocess/create_preprocessed_wide.sql (3)
5-7: Transaction and cleanup are correctly scoped
Wrapping the DROP and CREATE statements in a transaction ensures atomic schema updates and avoids partial states. RetainingDROP TABLE IF EXISTSsimplifies iterative development and local experimentation.
9-14: Table definition has appropriate primary key
Definingtime TIMESTAMPTZ NOT NULLandera_identifier TEXT NOT NULLwith a composite primary key is exactly what's needed for uniqueness. The comment placeholder for sensor & engineered columns is clear for the Python script to fill in additional columns.
23-23: Commit completes the DDL block
COMMIT;finalizes the transaction. Everything here aligns with best practices for TimescaleDB DDL scripts.DataIngestion/feature_extraction/pre_process/preprocess.dockerfile (2)
33-33: New SQL script is correctly copied into the image
COPY create_preprocessed_wide.sql /app/create_preprocessed_wide.sqlensures the DDL for the wide table is available at runtime.
37-38: Updated container startup to initialize new hypertable
TheCMDnow runspsql -f /app/create_preprocessed_wide.sqlbefore callingpython preprocess.py, aligning the docker startup with the new schema.DataIngestion/feature_extraction/era_detection_rust/src/main.rs (1)
43-44: Default table name updated to new hypertable
Changingdefault_value = "preprocessed_wide"for--db-tablealigns the Rust CLI’s default with the new TimescaleDB schema. Remember to update any documentation or examples that reference the old table name.DataIngestion/docker-compose.yml (1)
57-57: Mount new wide-table init script in DB container
Mappingcreate_preprocessed_wide.sqlinto/docker-entrypoint-initdb.d/02_create_preprocessed_wide.sqlensures the database is seeded with the correct DDL on startup. The filename prefix02_guarantees the init order follows the existing scripts.DataIngestion/feature_extraction/pre_process/prepare_era_data.py (2)
7-8: LGTM! Clean imports for the new database functionality.The imports are correctly added to support SQLAlchemy engine creation and the new wide table saving functionality.
281-282: LGTM! Correct integration of the new wide table saving.The call to
save_wide_to_timescaledbcorrectly passes the processed DataFrame, era identifier, and engine, and appropriately logs the number of inserted rows.DataIngestion/feature_extraction/feature-gpu/feature_classes/test.py (3)
64-64: LGTM! Updated docstring reflects the new table.The docstring correctly references the new
preprocessed_widetable.
69-75: LGTM! Query updated for wide table format.The SQL query correctly targets the new
preprocessed_widetable and usesSELECT *which is appropriate for the wide format where each feature is a separate column.🧰 Tools
🪛 Ruff (0.11.9)
69-75: f-string without any placeholders
Remove extraneous
fprefix(F541)
86-86: LGTM! Simplified data processing for wide format.The direct conversion from pandas to cuDF is correct since JSON normalization is no longer needed with the wide table format where features are already in separate columns.
DataIngestion/feature_extraction/pre_process/create_preprocessed_wide.sql (1)
5-23: LGTM! Well-structured SQL script for wide table creation.The script correctly:
- Uses transaction boundaries for atomicity
- Drops existing table to ensure clean recreation
- Creates appropriate schema with time and era_identifier columns
- Sets up TimescaleDB hypertable with reasonable 7-day chunk interval
- Uses composite primary key that makes sense for time-series data partitioned by era
The comment indicates this mirrors the previous JSONB hypertable behavior while supporting direct tsfresh querying, which aligns with the PR objectives.
DataIngestion/feature_extraction/pre_process/preprocess.py (4)
21-22: LGTM! Imports updated for new wide table functionality.The imports correctly add the new functions needed for table verification and wide table data saving.
69-69: LGTM! SQL script path updated for wide table creation.The path correctly references the new
create_preprocessed_wide.sqlscript that creates the wide format table.
137-138: LGTM! Table verification updated consistently.The verification correctly checks for the existence of the new
preprocessed_widetable instead of the oldpreprocessed_featurestable.
503-505: LGTM! Proper integration of wide table data saving.The code correctly:
- Checks for table existence before attempting to save
- Uses the new
save_wide_to_timescaledbfunction with appropriate parameters- Logs the number of rows inserted for monitoring purposes
This maintains the same conditional saving pattern while using the new wide table format.
DataIngestion/feature_extraction/feature-gpu/extract_features_gpu.py (3)
45-45: LGTM: Query updated for wide table format.The query change from selecting specific columns to
SELECT * FROM preprocessed_wideis consistent with the migration to the new wide table format, eliminating the need for JSONB unnesting.
56-56: LGTM: Clear logging of data dimensions.The logging statement provides useful information about the loaded dataset dimensions.
58-58: LGTM: Pandas to cuDF conversion.The direct conversion from pandas DataFrame to cuDF is appropriate and simplified now that the data comes pre-structured from the wide table.
DataIngestion/feature_extraction/pre_process/database_operations.py (1)
138-159: LGTM: Well-implemented wide table saving function.The new
save_wide_to_timescaledbfunction is well-designed for the migration to wide table format:
- Improved performance: Larger chunk size (10,000) should provide better insertion performance
- Better return value: Returns actual row count instead of boolean, providing more useful information
- Proper error handling: Gracefully handles empty DataFrames and exceptions
- Clean implementation: Directly saves wide format data without complex transformations
The function correctly adds the
era_identifiercolumn and uses appropriateto_sqlparameters for TimescaleDB integration.
| This script connects to the TimescaleDB instance, reads the `preprocessed_wide` hypertable | ||
| (written by the preprocessing pipeline) where each feature already has its own column, | ||
| converts it to the long format expected by *tsfresh*, and finally extracts a rich |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring describes DB ingestion but code reads from JSON
The updated docstring states that this script “connects to the TimescaleDB instance” and “reads the preprocessed_wide hypertable,” yet the implementation still reads from config.CONSOLIDATED_DATA_FILE_PATH (a JSON file). This mismatch could confuse users and indicates the code wasn’t updated to actually query the new table.
To align behavior with the docstring, consider replacing the JSON read with a SQL query, for example:
- consolidated_df = original_pandas.read_json(
- consolidated_data_file_path, lines=True, orient="records"
- )
+ // Establish a DB connection and fetch from preprocessed_wide
+ let connector = SQLAlchemyPostgresConnector(
+ user=DB_USER, password=DB_PASSWORD,
+ host=DB_HOST, port=DB_PORT, db_name=DB_NAME
+ );
+ consolidated_df = original_pandas.read_sql_table(
+ "preprocessed_wide", connector.engine
+ );Please update the code or adjust the docstring to ensure consistency.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In DataIngestion/feature_extraction/feature/extract_features.py around lines 3
to 5, the docstring claims the script connects to TimescaleDB and reads from the
preprocessed_wide hypertable, but the code still reads from a JSON file at
config.CONSOLIDATED_DATA_FILE_PATH. To fix this inconsistency, either update the
code to replace the JSON file reading with a SQL query that fetches data from
the TimescaleDB preprocessed_wide hypertable, or modify the docstring to
accurately reflect that the data source is still the JSON file. Ensure the data
loading method and the docstring description match exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@claude help me with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| pdf = connector.fetch_data_to_pandas(base_q if params is None | ||
| else connector.engine.execute(base_q, | ||
| params)) | ||
| else connector.engine.execute(base_q, params)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix inconsistent database query execution.
The current code has inconsistent method usage that will cause runtime errors. The connector.engine.execute() method is being called incorrectly, and the logic mixing two different execution patterns is problematic.
Apply this diff to fix the query execution:
- pdf = connector.fetch_data_to_pandas(base_q if params is None
- else connector.engine.execute(base_q, params))
+ pdf = connector.fetch_data_to_pandas(base_q, params)The fetch_data_to_pandas method should handle both cases (with and without parameters) internally, as shown in the relevant code snippet from test.py.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pdf = connector.fetch_data_to_pandas(base_q if params is None | |
| else connector.engine.execute(base_q, | |
| params)) | |
| else connector.engine.execute(base_q, params)) | |
| pdf = connector.fetch_data_to_pandas(base_q, params) |
🤖 Prompt for AI Agents
In DataIngestion/feature_extraction/feature-gpu/extract_features_gpu.py around
lines 51 to 52, the code incorrectly mixes direct query strings and executed
results when calling connector.fetch_data_to_pandas, causing runtime errors. To
fix this, always pass the raw query string and parameters separately to
fetch_data_to_pandas, letting it handle execution internally. Remove the call to
connector.engine.execute and instead call fetch_data_to_pandas with the query
and params arguments consistently.
Summary
Testing
python -m py_compile DataIngestion/feature_extraction/pre_process/preprocess.py DataIngestion/feature_extraction/pre_process/database_operations.py DataIngestion/feature_extraction/pre_process/prepare_era_data.py DataIngestion/feature_extraction/feature-gpu/extract_features_gpu.py DataIngestion/feature_extraction/feature-gpu/feature_classes/test.py DataIngestion/feature_extraction/feature/extract_features.pySummary by CodeRabbit
New Features
Bug Fixes
Chores