Skip to content

Conversation

@Fnux8890
Copy link
Owner

@Fnux8890 Fnux8890 commented May 23, 2025

Summary

  • ensure new preprocessed_wide DDL retains drop/commit and chunk interval
  • mirror same script for local reports

Testing

  • python -m py_compile DataIngestion/feature_extraction/pre_process/preprocess.py DataIngestion/feature_extraction/pre_process/database_operations.py DataIngestion/feature_extraction/pre_process/prepare_era_data.py DataIngestion/feature_extraction/feature-gpu/extract_features_gpu.py DataIngestion/feature_extraction/feature-gpu/feature_classes/test.py DataIngestion/feature_extraction/feature/extract_features.py

Summary by CodeRabbit

  • New Features

    • Introduced a new database table for storing preprocessed data in a wide format, enabling more efficient access and processing.
    • Added new scripts to create and manage the wide-format table in the database.
  • Bug Fixes

    • Simplified and streamlined the data fetching and preprocessing pipeline for improved reliability and maintainability.
  • Chores

    • Updated configuration and documentation to reflect the new wide-format table.
    • Removed unused dependencies related to Parquet file support.

@coderabbitai
Copy link

coderabbitai bot commented May 23, 2025

Walkthrough

This change transitions the data ingestion and feature extraction pipeline from using the preprocessed_features table (with a JSONB column for features) to a new preprocessed_wide table where each feature is a separate column. All related scripts, Docker configurations, SQL initialization, and data processing logic are updated accordingly.

Changes

File(s) Change Summary
DataIngestion/docker-compose.yml
DataIngestion/feature_extraction/pre_process/preprocess.dockerfile
Updated SQL initialization script from create_preprocessed_hypertable.sql to create_preprocessed_wide.sql.
DataIngestion/feature_extraction/pre_process/create_preprocessed_wide.sql
DataIngestion/feature_extraction/pre_process/report_for_preprocess/create_preprocessed_wide.sql
Added SQL scripts to create the preprocessed_wide hypertable with time and era_identifier as primary keys.
DataIngestion/feature_extraction/era_detection_rust/src/main.rs Changed default CLI argument for db_table from preprocessed_features to preprocessed_wide.
DataIngestion/feature_extraction/feature-gpu/extract_features_gpu.py
DataIngestion/feature_extraction/feature-gpu/feature_classes/test.py
Modified SQL queries and data handling to use preprocessed_wide table; removed JSON normalization logic.
DataIngestion/feature_extraction/feature/extract_features.py Updated docstring to reflect use of preprocessed_wide table with wide-format features.
DataIngestion/feature_extraction/pre_process/database_operations.py Replaced save_to_timescaledb with save_wide_to_timescaledb for direct wide-table insertion; removed JSON logic.
DataIngestion/feature_extraction/pre_process/prepare_era_data.py Added DB connection and saving via save_wide_to_timescaledb in the preprocessing workflow.
DataIngestion/feature_extraction/pre_process/preprocess.py Updated SQL script path, table checks, and data saving logic to use preprocessed_wide and new save function.
DataIngestion/feature_extraction/pre_process/requirements.txt Removed pyarrow and fastparquet dependencies.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant PreprocessScript
    participant Database
    participant FeatureExtractionScript

    User->>PreprocessScript: Run preprocessing for an era
    PreprocessScript->>Database: Create/initialize preprocessed_wide table (via SQL script)
    PreprocessScript->>Database: Insert wide-format features (save_wide_to_timescaledb)
    User->>FeatureExtractionScript: Run feature extraction
    FeatureExtractionScript->>Database: Query all columns from preprocessed_wide
    FeatureExtractionScript->>User: Output extracted features
Loading

Possibly related PRs

  • Fnux8890/Proactive-thesis#24: Refactors feature extraction to directly query and process the wide-format table, closely related to this PR's transition to preprocessed_wide.

Poem

A table once narrow, now stretches out wide,
With features uncurled, no JSON to hide.
Scripts and SQL in harmony stride,
Data flows swiftly, no need to divide.
The rabbit hops on, with columns in view—
Preprocessed and ready, for something new! 🐇

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Fnux8890 Fnux8890 marked this pull request as ready for review May 24, 2025 10:03
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
DataIngestion/feature_extraction/pre_process/report_for_preprocess/create_preprocessed_wide.sql (1)

16-21: Hypertable creation parameters look good
Using if_not_exists => TRUE avoids failures if the hypertable has already been created, and a 7-day chunk_time_interval is a reasonable default. You might consider adding a retention policy later (e.g., via add_retention_policy) if old data needs automated cleanup.

DataIngestion/feature_extraction/pre_process/prepare_era_data.py (2)

229-233: Verify engine creation error handling.

The engine creation looks correct, but consider adding error handling around the engine creation since database connection failures could occur here.

+    try:
        era_to_process = config.get("common_settings", {}).get("default_era_to_process_for_script", "Era1")
        conn_details = get_db_connection_details(config)
        engine = create_engine(
            f"postgresql://{conn_details['user']}:{conn_details['password']}@{conn_details['host']}:{conn_details['port']}/{conn_details['dbname']}"
        )
+    except Exception as e:
+        print(f"Error creating database engine: {e}")
+        return

294-295: Improve error handling for engine disposal.

The finally block correctly ensures engine disposal, but should handle potential disposal errors and ensure the engine variable exists.

    finally:
-        engine.dispose()
+        if 'engine' in locals() and engine:
+            try:
+                engine.dispose()
+            except Exception as e:
+                print(f"Error disposing database engine: {e}")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8e647e8 and aa52415.

📒 Files selected for processing (12)
  • DataIngestion/docker-compose.yml (1 hunks)
  • DataIngestion/feature_extraction/era_detection_rust/src/main.rs (1 hunks)
  • DataIngestion/feature_extraction/feature-gpu/extract_features_gpu.py (1 hunks)
  • DataIngestion/feature_extraction/feature-gpu/feature_classes/test.py (2 hunks)
  • DataIngestion/feature_extraction/feature/extract_features.py (1 hunks)
  • DataIngestion/feature_extraction/pre_process/create_preprocessed_wide.sql (1 hunks)
  • DataIngestion/feature_extraction/pre_process/database_operations.py (1 hunks)
  • DataIngestion/feature_extraction/pre_process/prepare_era_data.py (4 hunks)
  • DataIngestion/feature_extraction/pre_process/preprocess.dockerfile (1 hunks)
  • DataIngestion/feature_extraction/pre_process/preprocess.py (4 hunks)
  • DataIngestion/feature_extraction/pre_process/report_for_preprocess/create_preprocessed_wide.sql (1 hunks)
  • DataIngestion/feature_extraction/pre_process/requirements.txt (0 hunks)
💤 Files with no reviewable changes (1)
  • DataIngestion/feature_extraction/pre_process/requirements.txt
🧰 Additional context used
🧬 Code Graph Analysis (2)
DataIngestion/feature_extraction/feature-gpu/extract_features_gpu.py (1)
DataIngestion/feature_extraction/feature-gpu/feature_classes/test.py (1)
  • fetch_data_to_pandas (47-56)
DataIngestion/feature_extraction/pre_process/preprocess.py (1)
DataIngestion/feature_extraction/pre_process/database_operations.py (2)
  • verify_table_exists (117-136)
  • save_wide_to_timescaledb (138-158)
🔇 Additional comments (21)
DataIngestion/feature_extraction/pre_process/report_for_preprocess/create_preprocessed_wide.sql (3)

5-7: Transaction and cleanup are correctly scoped
Wrapping the DROP and CREATE statements in a transaction ensures atomic schema updates and avoids partial states. Retaining DROP TABLE IF EXISTS simplifies iterative development and local experimentation.


9-14: Table definition has appropriate primary key
Defining time TIMESTAMPTZ NOT NULL and era_identifier TEXT NOT NULL with a composite primary key is exactly what's needed for uniqueness. The comment placeholder for sensor & engineered columns is clear for the Python script to fill in additional columns.


23-23: Commit completes the DDL block
COMMIT; finalizes the transaction. Everything here aligns with best practices for TimescaleDB DDL scripts.

DataIngestion/feature_extraction/pre_process/preprocess.dockerfile (2)

33-33: New SQL script is correctly copied into the image
COPY create_preprocessed_wide.sql /app/create_preprocessed_wide.sql ensures the DDL for the wide table is available at runtime.


37-38: Updated container startup to initialize new hypertable
The CMD now runs psql -f /app/create_preprocessed_wide.sql before calling python preprocess.py, aligning the docker startup with the new schema.

DataIngestion/feature_extraction/era_detection_rust/src/main.rs (1)

43-44: Default table name updated to new hypertable
Changing default_value = "preprocessed_wide" for --db-table aligns the Rust CLI’s default with the new TimescaleDB schema. Remember to update any documentation or examples that reference the old table name.

DataIngestion/docker-compose.yml (1)

57-57: Mount new wide-table init script in DB container
Mapping create_preprocessed_wide.sql into /docker-entrypoint-initdb.d/02_create_preprocessed_wide.sql ensures the database is seeded with the correct DDL on startup. The filename prefix 02_ guarantees the init order follows the existing scripts.

DataIngestion/feature_extraction/pre_process/prepare_era_data.py (2)

7-8: LGTM! Clean imports for the new database functionality.

The imports are correctly added to support SQLAlchemy engine creation and the new wide table saving functionality.


281-282: LGTM! Correct integration of the new wide table saving.

The call to save_wide_to_timescaledb correctly passes the processed DataFrame, era identifier, and engine, and appropriately logs the number of inserted rows.

DataIngestion/feature_extraction/feature-gpu/feature_classes/test.py (3)

64-64: LGTM! Updated docstring reflects the new table.

The docstring correctly references the new preprocessed_wide table.


69-75: LGTM! Query updated for wide table format.

The SQL query correctly targets the new preprocessed_wide table and uses SELECT * which is appropriate for the wide format where each feature is a separate column.

🧰 Tools
🪛 Ruff (0.11.9)

69-75: f-string without any placeholders

Remove extraneous f prefix

(F541)


86-86: LGTM! Simplified data processing for wide format.

The direct conversion from pandas to cuDF is correct since JSON normalization is no longer needed with the wide table format where features are already in separate columns.

DataIngestion/feature_extraction/pre_process/create_preprocessed_wide.sql (1)

5-23: LGTM! Well-structured SQL script for wide table creation.

The script correctly:

  • Uses transaction boundaries for atomicity
  • Drops existing table to ensure clean recreation
  • Creates appropriate schema with time and era_identifier columns
  • Sets up TimescaleDB hypertable with reasonable 7-day chunk interval
  • Uses composite primary key that makes sense for time-series data partitioned by era

The comment indicates this mirrors the previous JSONB hypertable behavior while supporting direct tsfresh querying, which aligns with the PR objectives.

DataIngestion/feature_extraction/pre_process/preprocess.py (4)

21-22: LGTM! Imports updated for new wide table functionality.

The imports correctly add the new functions needed for table verification and wide table data saving.


69-69: LGTM! SQL script path updated for wide table creation.

The path correctly references the new create_preprocessed_wide.sql script that creates the wide format table.


137-138: LGTM! Table verification updated consistently.

The verification correctly checks for the existence of the new preprocessed_wide table instead of the old preprocessed_features table.


503-505: LGTM! Proper integration of wide table data saving.

The code correctly:

  • Checks for table existence before attempting to save
  • Uses the new save_wide_to_timescaledb function with appropriate parameters
  • Logs the number of rows inserted for monitoring purposes

This maintains the same conditional saving pattern while using the new wide table format.

DataIngestion/feature_extraction/feature-gpu/extract_features_gpu.py (3)

45-45: LGTM: Query updated for wide table format.

The query change from selecting specific columns to SELECT * FROM preprocessed_wide is consistent with the migration to the new wide table format, eliminating the need for JSONB unnesting.


56-56: LGTM: Clear logging of data dimensions.

The logging statement provides useful information about the loaded dataset dimensions.


58-58: LGTM: Pandas to cuDF conversion.

The direct conversion from pandas DataFrame to cuDF is appropriate and simplified now that the data comes pre-structured from the wide table.

DataIngestion/feature_extraction/pre_process/database_operations.py (1)

138-159: LGTM: Well-implemented wide table saving function.

The new save_wide_to_timescaledb function is well-designed for the migration to wide table format:

  • Improved performance: Larger chunk size (10,000) should provide better insertion performance
  • Better return value: Returns actual row count instead of boolean, providing more useful information
  • Proper error handling: Gracefully handles empty DataFrames and exceptions
  • Clean implementation: Directly saves wide format data without complex transformations

The function correctly adds the era_identifier column and uses appropriate to_sql parameters for TimescaleDB integration.

Comment on lines +3 to +5
This script connects to the TimescaleDB instance, reads the `preprocessed_wide` hypertable
(written by the preprocessing pipeline) where each feature already has its own column,
converts it to the long format expected by *tsfresh*, and finally extracts a rich
Copy link

@coderabbitai coderabbitai bot May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Docstring describes DB ingestion but code reads from JSON
The updated docstring states that this script “connects to the TimescaleDB instance” and “reads the preprocessed_wide hypertable,” yet the implementation still reads from config.CONSOLIDATED_DATA_FILE_PATH (a JSON file). This mismatch could confuse users and indicates the code wasn’t updated to actually query the new table.

To align behavior with the docstring, consider replacing the JSON read with a SQL query, for example:

-    consolidated_df = original_pandas.read_json(
-        consolidated_data_file_path, lines=True, orient="records"
-    )
+    // Establish a DB connection and fetch from preprocessed_wide
+    let connector = SQLAlchemyPostgresConnector(
+        user=DB_USER, password=DB_PASSWORD,
+        host=DB_HOST, port=DB_PORT, db_name=DB_NAME
+    );
+    consolidated_df = original_pandas.read_sql_table(
+        "preprocessed_wide", connector.engine
+    );

Please update the code or adjust the docstring to ensure consistency.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In DataIngestion/feature_extraction/feature/extract_features.py around lines 3
to 5, the docstring claims the script connects to TimescaleDB and reads from the
preprocessed_wide hypertable, but the code still reads from a JSON file at
config.CONSOLIDATED_DATA_FILE_PATH. To fix this inconsistency, either update the
code to replace the JSON file reading with a SQL query that fetches data from
the TimescaleDB preprocessed_wide hypertable, or modify the docstring to
accurately reflect that the data source is still the JSON file. Ensure the data
loading method and the docstring description match exactly.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude help me with this

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link

@claude claude bot May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude encountered an error —— View job


I'll analyze this and get back to you.

Comment on lines 51 to +52
pdf = connector.fetch_data_to_pandas(base_q if params is None
else connector.engine.execute(base_q,
params))
else connector.engine.execute(base_q, params))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix inconsistent database query execution.

The current code has inconsistent method usage that will cause runtime errors. The connector.engine.execute() method is being called incorrectly, and the logic mixing two different execution patterns is problematic.

Apply this diff to fix the query execution:

-    pdf = connector.fetch_data_to_pandas(base_q if params is None
-                                         else connector.engine.execute(base_q, params))
+    pdf = connector.fetch_data_to_pandas(base_q, params)

The fetch_data_to_pandas method should handle both cases (with and without parameters) internally, as shown in the relevant code snippet from test.py.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pdf = connector.fetch_data_to_pandas(base_q if params is None
else connector.engine.execute(base_q,
params))
else connector.engine.execute(base_q, params))
pdf = connector.fetch_data_to_pandas(base_q, params)
🤖 Prompt for AI Agents
In DataIngestion/feature_extraction/feature-gpu/extract_features_gpu.py around
lines 51 to 52, the code incorrectly mixes direct query strings and executed
results when calling connector.fetch_data_to_pandas, causing runtime errors. To
fix this, always pass the raw query string and parameters separately to
fetch_data_to_pandas, letting it handle execution internally. Remove the call to
connector.engine.execute and instead call fetch_data_to_pandas with the query
and params arguments consistently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant