Update wide hypertable DDL #31

Fnux8890 · 2025-05-23T06:23:20Z

Summary

ensure new preprocessed_wide DDL retains drop/commit and chunk interval
mirror same script for local reports

Testing

python -m py_compile DataIngestion/feature_extraction/pre_process/preprocess.py DataIngestion/feature_extraction/pre_process/database_operations.py DataIngestion/feature_extraction/pre_process/prepare_era_data.py DataIngestion/feature_extraction/feature-gpu/extract_features_gpu.py DataIngestion/feature_extraction/feature-gpu/feature_classes/test.py DataIngestion/feature_extraction/feature/extract_features.py

Summary by CodeRabbit

New Features
- Introduced a new database table for storing preprocessed data in a wide format, enabling more efficient access and processing.
- Added new scripts to create and manage the wide-format table in the database.
Bug Fixes
- Simplified and streamlined the data fetching and preprocessing pipeline for improved reliability and maintainability.
Chores
- Updated configuration and documentation to reflect the new wide-format table.
- Removed unused dependencies related to Parquet file support.

coderabbitai · 2025-05-23T06:23:26Z

Walkthrough

This change transitions the data ingestion and feature extraction pipeline from using the preprocessed_features table (with a JSONB column for features) to a new preprocessed_wide table where each feature is a separate column. All related scripts, Docker configurations, SQL initialization, and data processing logic are updated accordingly.

Changes

File(s)	Change Summary
DataIngestion/docker-compose.yml DataIngestion/feature_extraction/pre_process/preprocess.dockerfile	Updated SQL initialization script from `create_preprocessed_hypertable.sql` to `create_preprocessed_wide.sql`.
DataIngestion/feature_extraction/pre_process/create_preprocessed_wide.sql DataIngestion/feature_extraction/pre_process/report_for_preprocess/create_preprocessed_wide.sql	Added SQL scripts to create the `preprocessed_wide` hypertable with `time` and `era_identifier` as primary keys.
DataIngestion/feature_extraction/era_detection_rust/src/main.rs	Changed default CLI argument for `db_table` from `preprocessed_features` to `preprocessed_wide`.
DataIngestion/feature_extraction/feature-gpu/extract_features_gpu.py DataIngestion/feature_extraction/feature-gpu/feature_classes/test.py	Modified SQL queries and data handling to use `preprocessed_wide` table; removed JSON normalization logic.
DataIngestion/feature_extraction/feature/extract_features.py	Updated docstring to reflect use of `preprocessed_wide` table with wide-format features.
DataIngestion/feature_extraction/pre_process/database_operations.py	Replaced `save_to_timescaledb` with `save_wide_to_timescaledb` for direct wide-table insertion; removed JSON logic.
DataIngestion/feature_extraction/pre_process/prepare_era_data.py	Added DB connection and saving via `save_wide_to_timescaledb` in the preprocessing workflow.
DataIngestion/feature_extraction/pre_process/preprocess.py	Updated SQL script path, table checks, and data saving logic to use `preprocessed_wide` and new save function.
DataIngestion/feature_extraction/pre_process/requirements.txt	Removed `pyarrow` and `fastparquet` dependencies.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant PreprocessScript
    participant Database
    participant FeatureExtractionScript

    User->>PreprocessScript: Run preprocessing for an era
    PreprocessScript->>Database: Create/initialize preprocessed_wide table (via SQL script)
    PreprocessScript->>Database: Insert wide-format features (save_wide_to_timescaledb)
    User->>FeatureExtractionScript: Run feature extraction
    FeatureExtractionScript->>Database: Query all columns from preprocessed_wide
    FeatureExtractionScript->>User: Output extracted features

Possibly related PRs

Fnux8890/Proactive-thesis#24: Refactors feature extraction to directly query and process the wide-format table, closely related to this PR's transition to preprocessed_wide.

Poem

A table once narrow, now stretches out wide,
With features uncurled, no JSON to hide.
Scripts and SQL in harmony stride,
Data flows swiftly, no need to divide.
The rabbit hops on, with columns in view—
Preprocessed and ready, for something new! 🐇

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

DataIngestion/feature_extraction/pre_process/report_for_preprocess/create_preprocessed_wide.sql (1)

16-21: Hypertable creation parameters look good
Using if_not_exists => TRUE avoids failures if the hypertable has already been created, and a 7-day chunk_time_interval is a reasonable default. You might consider adding a retention policy later (e.g., via add_retention_policy) if old data needs automated cleanup.
DataIngestion/feature_extraction/pre_process/prepare_era_data.py (2)
229-233: Verify engine creation error handling.

The engine creation looks correct, but consider adding error handling around the engine creation since database connection failures could occur here.
+    try:
        era_to_process = config.get("common_settings", {}).get("default_era_to_process_for_script", "Era1")
        conn_details = get_db_connection_details(config)
        engine = create_engine(
            f"postgresql://{conn_details['user']}:{conn_details['password']}@{conn_details['host']}:{conn_details['port']}/{conn_details['dbname']}"
        )
+    except Exception as e:
+        print(f"Error creating database engine: {e}")
+        return
294-295: Improve error handling for engine disposal.

The finally block correctly ensures engine disposal, but should handle potential disposal errors and ensure the engine variable exists.
    finally:
-        engine.dispose()
+        if 'engine' in locals() and engine:
+            try:
+                engine.dispose()
+            except Exception as e:
+                print(f"Error disposing database engine: {e}")

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8e647e8 and aa52415.

📒 Files selected for processing (12)

DataIngestion/docker-compose.yml (1 hunks)
DataIngestion/feature_extraction/era_detection_rust/src/main.rs (1 hunks)
DataIngestion/feature_extraction/feature-gpu/extract_features_gpu.py (1 hunks)
DataIngestion/feature_extraction/feature-gpu/feature_classes/test.py (2 hunks)
DataIngestion/feature_extraction/feature/extract_features.py (1 hunks)
DataIngestion/feature_extraction/pre_process/create_preprocessed_wide.sql (1 hunks)
DataIngestion/feature_extraction/pre_process/database_operations.py (1 hunks)
DataIngestion/feature_extraction/pre_process/prepare_era_data.py (4 hunks)
DataIngestion/feature_extraction/pre_process/preprocess.dockerfile (1 hunks)
DataIngestion/feature_extraction/pre_process/preprocess.py (4 hunks)
DataIngestion/feature_extraction/pre_process/report_for_preprocess/create_preprocessed_wide.sql (1 hunks)
DataIngestion/feature_extraction/pre_process/requirements.txt (0 hunks)

💤 Files with no reviewable changes (1)

DataIngestion/feature_extraction/pre_process/requirements.txt

🧰 Additional context used

🧬 Code Graph Analysis (2)

DataIngestion/feature_extraction/feature-gpu/extract_features_gpu.py (1)

DataIngestion/feature_extraction/feature-gpu/feature_classes/test.py (1)

fetch_data_to_pandas (47-56)

DataIngestion/feature_extraction/pre_process/preprocess.py (1)

DataIngestion/feature_extraction/pre_process/database_operations.py (2)

verify_table_exists (117-136)

save_wide_to_timescaledb (138-158)

🔇 Additional comments (21)

DataIngestion/feature_extraction/pre_process/report_for_preprocess/create_preprocessed_wide.sql (3)

5-7: Transaction and cleanup are correctly scoped
Wrapping the DROP and CREATE statements in a transaction ensures atomic schema updates and avoids partial states. Retaining DROP TABLE IF EXISTS simplifies iterative development and local experimentation.

9-14: Table definition has appropriate primary key
Defining time TIMESTAMPTZ NOT NULL and era_identifier TEXT NOT NULL with a composite primary key is exactly what's needed for uniqueness. The comment placeholder for sensor & engineered columns is clear for the Python script to fill in additional columns.

23-23: Commit completes the DDL block
COMMIT; finalizes the transaction. Everything here aligns with best practices for TimescaleDB DDL scripts.

DataIngestion/feature_extraction/pre_process/preprocess.dockerfile (2)

33-33: New SQL script is correctly copied into the image
COPY create_preprocessed_wide.sql /app/create_preprocessed_wide.sql ensures the DDL for the wide table is available at runtime.

37-38: Updated container startup to initialize new hypertable
The CMD now runs psql -f /app/create_preprocessed_wide.sql before calling python preprocess.py, aligning the docker startup with the new schema.

DataIngestion/feature_extraction/era_detection_rust/src/main.rs (1)

43-44: Default table name updated to new hypertable
Changing default_value = "preprocessed_wide" for --db-table aligns the Rust CLI’s default with the new TimescaleDB schema. Remember to update any documentation or examples that reference the old table name.

DataIngestion/docker-compose.yml (1)

57-57: Mount new wide-table init script in DB container
Mapping create_preprocessed_wide.sql into /docker-entrypoint-initdb.d/02_create_preprocessed_wide.sql ensures the database is seeded with the correct DDL on startup. The filename prefix 02_ guarantees the init order follows the existing scripts.

DataIngestion/feature_extraction/pre_process/prepare_era_data.py (2)

7-8: LGTM! Clean imports for the new database functionality.

The imports are correctly added to support SQLAlchemy engine creation and the new wide table saving functionality.

281-282: LGTM! Correct integration of the new wide table saving.

The call to save_wide_to_timescaledb correctly passes the processed DataFrame, era identifier, and engine, and appropriately logs the number of inserted rows.

DataIngestion/feature_extraction/feature-gpu/feature_classes/test.py (3)

64-64: LGTM! Updated docstring reflects the new table.

The docstring correctly references the new preprocessed_wide table.

69-75: LGTM! Query updated for wide table format.

The SQL query correctly targets the new preprocessed_wide table and uses SELECT * which is appropriate for the wide format where each feature is a separate column.

🧰 Tools

🪛 Ruff (0.11.9)

69-75: f-string without any placeholders

Remove extraneous f prefix

(F541)

86-86: LGTM! Simplified data processing for wide format.

The direct conversion from pandas to cuDF is correct since JSON normalization is no longer needed with the wide table format where features are already in separate columns.

DataIngestion/feature_extraction/pre_process/create_preprocessed_wide.sql (1)

5-23: LGTM! Well-structured SQL script for wide table creation.

The script correctly:

Uses transaction boundaries for atomicity

Drops existing table to ensure clean recreation

Creates appropriate schema with time and era_identifier columns

Sets up TimescaleDB hypertable with reasonable 7-day chunk interval

Uses composite primary key that makes sense for time-series data partitioned by era

The comment indicates this mirrors the previous JSONB hypertable behavior while supporting direct tsfresh querying, which aligns with the PR objectives.

DataIngestion/feature_extraction/pre_process/preprocess.py (4)

21-22: LGTM! Imports updated for new wide table functionality.

The imports correctly add the new functions needed for table verification and wide table data saving.

69-69: LGTM! SQL script path updated for wide table creation.

The path correctly references the new create_preprocessed_wide.sql script that creates the wide format table.

137-138: LGTM! Table verification updated consistently.

The verification correctly checks for the existence of the new preprocessed_wide table instead of the old preprocessed_features table.

503-505: LGTM! Proper integration of wide table data saving.

The code correctly:

Checks for table existence before attempting to save

Uses the new save_wide_to_timescaledb function with appropriate parameters

Logs the number of rows inserted for monitoring purposes

This maintains the same conditional saving pattern while using the new wide table format.

DataIngestion/feature_extraction/feature-gpu/extract_features_gpu.py (3)

45-45: LGTM: Query updated for wide table format.

The query change from selecting specific columns to SELECT * FROM preprocessed_wide is consistent with the migration to the new wide table format, eliminating the need for JSONB unnesting.

56-56: LGTM: Clear logging of data dimensions.

The logging statement provides useful information about the loaded dataset dimensions.

58-58: LGTM: Pandas to cuDF conversion.

The direct conversion from pandas DataFrame to cuDF is appropriate and simplified now that the data comes pre-structured from the wide table.

DataIngestion/feature_extraction/pre_process/database_operations.py (1)

138-159: LGTM: Well-implemented wide table saving function.

The new save_wide_to_timescaledb function is well-designed for the migration to wide table format:

Improved performance: Larger chunk size (10,000) should provide better insertion performance

Better return value: Returns actual row count instead of boolean, providing more useful information

Proper error handling: Gracefully handles empty DataFrames and exceptions

Clean implementation: Directly saves wide format data without complex transformations

The function correctly adds the era_identifier column and uses appropriate to_sql parameters for TimescaleDB integration.

coderabbitai · 2025-05-24T10:08:11Z

DataIngestion/feature_extraction/feature/extract_features.py

+This script connects to the TimescaleDB instance, reads the `preprocessed_wide` hypertable
+ (written by the preprocessing pipeline) where each feature already has its own column,
+converts it to the long format expected by *tsfresh*, and finally extracts a rich


⚠️ Potential issue

Docstring describes DB ingestion but code reads from JSON
The updated docstring states that this script “connects to the TimescaleDB instance” and “reads the preprocessed_wide hypertable,” yet the implementation still reads from config.CONSOLIDATED_DATA_FILE_PATH (a JSON file). This mismatch could confuse users and indicates the code wasn’t updated to actually query the new table.

To align behavior with the docstring, consider replacing the JSON read with a SQL query, for example:

- consolidated_df = original_pandas.read_json( - consolidated_data_file_path, lines=True, orient="records" - ) + // Establish a DB connection and fetch from preprocessed_wide + let connector = SQLAlchemyPostgresConnector( + user=DB_USER, password=DB_PASSWORD, + host=DB_HOST, port=DB_PORT, db_name=DB_NAME + ); + consolidated_df = original_pandas.read_sql_table( + "preprocessed_wide", connector.engine + );

Please update the code or adjust the docstring to ensure consistency.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In DataIngestion/feature_extraction/feature/extract_features.py around lines 3 to 5, the docstring claims the script connects to TimescaleDB and reads from the preprocessed_wide hypertable, but the code still reads from a JSON file at config.CONSOLIDATED_DATA_FILE_PATH. To fix this inconsistency, either update the code to replace the JSON file reading with a SQL query that fetches data from the TimescaleDB preprocessed_wide hypertable, or modify the docstring to accurately reflect that the data source is still the JSON file. Ensure the data loading method and the docstring description match exactly.

@claude help me with this

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Claude encountered an error —— View job

I'll analyze this and get back to you.

coderabbitai · 2025-05-24T10:08:12Z

DataIngestion/feature_extraction/feature-gpu/extract_features_gpu.py

    pdf = connector.fetch_data_to_pandas(base_q if params is None
-                                         else connector.engine.execute(base_q,
-                                                                       params))
+                                         else connector.engine.execute(base_q, params))


⚠️ Potential issue

Fix inconsistent database query execution.

The current code has inconsistent method usage that will cause runtime errors. The connector.engine.execute() method is being called incorrectly, and the logic mixing two different execution patterns is problematic.

Apply this diff to fix the query execution:

- pdf = connector.fetch_data_to_pandas(base_q if params is None - else connector.engine.execute(base_q, params)) + pdf = connector.fetch_data_to_pandas(base_q, params)

The fetch_data_to_pandas method should handle both cases (with and without parameters) internally, as shown in the relevant code snippet from test.py.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pdf = connector.fetch_data_to_pandas(base_q if params is None

else connector.engine.execute(base_q,

params))

else connector.engine.execute(base_q, params))

pdf = connector.fetch_data_to_pandas(base_q, params)

🤖 Prompt for AI Agents

In DataIngestion/feature_extraction/feature-gpu/extract_features_gpu.py around lines 51 to 52, the code incorrectly mixes direct query strings and executed results when calling connector.fetch_data_to_pandas, causing runtime errors. To fix this, always pass the raw query string and parameters separately to fetch_data_to_pandas, letting it handle execution internally. Remove the call to connector.engine.execute and instead call fetch_data_to_pandas with the query and params arguments consistently.

Enhance preprocessed_wide table DDL

aa52415

Fnux8890 added the codex label May 23, 2025 — with ChatGPT Codex Connector

Fnux8890 marked this pull request as ready for review May 24, 2025 10:03

coderabbitai bot reviewed May 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update wide hypertable DDL #31

Update wide hypertable DDL #31

Uh oh!

Fnux8890 commented May 23, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented May 23, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot May 24, 2025 •

edited

Loading

Uh oh!

Fnux8890 May 24, 2025

Uh oh!

coderabbitai bot May 24, 2025

Uh oh!

claude bot May 24, 2025 •

edited

Loading

Uh oh!

coderabbitai bot May 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Update wide hypertable DDL #31

Are you sure you want to change the base?

Update wide hypertable DDL #31

Uh oh!

Conversation

Fnux8890 commented May 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fnux8890 May 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 24, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fnux8890 commented May 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented May 23, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

coderabbitai bot May 24, 2025 •

edited

Loading

claude bot May 24, 2025 •

edited

Loading