Skip to content

Feat/reviews final clean#148

Merged
fedecaselli merged 3 commits intodevfrom
feat/reviews-final-clean
Mar 18, 2026
Merged

Feat/reviews final clean#148
fedecaselli merged 3 commits intodevfrom
feat/reviews-final-clean

Conversation

@fedecaselli
Copy link
Copy Markdown
Collaborator

No description provided.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
bookdb-landing d3a6015 Commit Preview URL

Branch Preview URL
Mar 12 2026, 11:01 AM

@fedecaselli fedecaselli requested a review from catebros March 17, 2026 12:42
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data processing pipeline by introducing a new step for cleaning and flagging Goodreads reviews. It focuses on improving data quality by systematically identifying and removing uninformative entries, and by clearly marking reviews that contain spoilers. These changes ensure that downstream models and analyses will operate on a more refined and feature-rich dataset. Additionally, it includes minor fixes to data pathing in existing exploratory data analysis notebooks and updates the data version control metadata to reflect the new state of the processed data.

Highlights

  • Data Version Control Update: The DVC metadata for the 'data' directory was updated, reflecting an increase in dataset size and file count.
  • Pathing Corrections in EDA Notebooks: Adjustments were made to the 'project_root' path calculation in two existing EDA notebooks to ensure correct data loading.
  • New Review Processing Notebook: A new Marimo notebook was introduced to perform comprehensive cleaning and spoiler flagging of Goodreads reviews. This includes dropping non-informative reviews and adding a 'has_spoiler' column.
Changelog
  • data.dvc
    • Updated DVC metadata to reflect an increase in the size and number of files within the 'data' directory.
  • notebooks/data/eda/reviews/reviews_dedup_eda.py
    • Corrected the 'project_root' path calculation by removing a redundant assignment.
  • notebooks/data/eda/reviews/reviews_spoiler_eda.py
    • Adjusted the 'project_root' path calculation to correctly resolve the project root.
  • notebooks/data/processing/reviews/5_reduce_and_flag_reviews.py
    • Added a new Marimo notebook for review data cleaning and spoiler flagging.
    • Implemented logic to drop the 'date_updated' column if present.
    • Introduced functions to identify and filter out empty, whitespace-only, single-character, punctuation-only, numbers-only, and repeated-character reviews.
    • Included a check for short reviews with low character variety.
    • Verified for duplicate 'review_id' entries in both dedup and spoiler datasets.
    • Applied combined filtering to remove non-informative and short+low-variety reviews from the dedup dataset.
    • Saved the IDs of dropped reviews to a JSON file.
    • Added a 'has_spoiler' flag to reviews based on their presence in the spoiler dataset.
    • Implemented validation checks for row counts, 'has_spoiler' values, and null/empty 'review_id' and 'review_text' fields.
    • Saved the final cleaned dataset to a new Parquet file.
Activity
  • No specific activity (comments, reviews, progress) was provided in the context for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new data processing script to clean and flag review data, which is a valuable addition to the data pipeline. The script effectively handles empty, non-informative, and duplicate reviews, and correctly adds a spoiler flag. The validation checks at the end ensure data integrity. However, there are opportunities to improve performance by leveraging native Polars expressions instead of Python UDFs for filtering and column creation, especially for large datasets.

def _(mo, pl):
import os
project_root = __import__("pathlib").Path(__file__).resolve().parents[3]
project_root = __import__("pathlib").Path(__file__).resolve().parents[4]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The parents level for project_root was changed from 3 to 4. Please verify if this change correctly resolves the project root path for this specific notebook. An incorrect path could lead to data loading errors.


# Get reviews to drop
df_to_drop = df_dedup_clean.filter(
pl.col("review_text").map_elements(_should_drop, return_dtype=pl.Boolean)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using map_elements with Python functions (_should_drop, is_non_informative, is_short_low_variety) can be significantly slower than using native Polars expressions, especially for large datasets. Consider rewriting these conditions using Polars expressions for better performance. For example, len(text) == 1 can be pl.col("review_text").str.len_chars() == 1, and regex checks can use pl.col("review_text").str.contains() or pl.col("review_text").str.full_match().

This aligns with the general rule to perform a single scan with multiple aggregations/operations using Polars for efficiency.

    df_to_drop = df_dedup_clean.filter(
        (pl.col("review_text").str.len_chars() == 1) |
        (pl.col("review_text").str.full_match(r'^[\W_]+$')) |
        (pl.col("review_text").str.full_match(r'^[\d\s]+$')) |
        (pl.col("review_text").str.full_match(r'^(.)\1+$')) |
        ((pl.col("review_text").str.len_chars() <= 5) & (pl.col("review_text").str.replace_all(" ", "").str.n_unique_chars() <= 2))
    )

    df_dedup_final = df_dedup_clean.filter(
        ~((pl.col("review_text").str.len_chars() == 1) |
          (pl.col("review_text").str.full_match(r'^[\W_]+$')) |
          (pl.col("review_text").str.full_match(r'^[\d\s]+$')) |
          (pl.col("review_text").str.full_match(r'^(.)\1+$')) |
          ((pl.col("review_text").str.len_chars() <= 5) & (pl.col("review_text").str.replace_all(" ", "").str.n_unique_chars() <= 2)))
    )
References
  1. When calculating multiple statistics (e.g., count, null counts) from a dataset using Polars, perform a single scan with multiple aggregations in one select statement. This is more efficient than performing multiple separate scans or reads.

Comment on lines +437 to +441
pl.col("review_id").map_elements(
lambda x: 1 if x in spoiler_ids else 0,
return_dtype=pl.Int8
).alias("has_spoiler")
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using map_elements with a lambda function for checking review_id in spoiler_ids can be inefficient. Polars offers a native is_in expression which is highly optimized for this type of operation. This will improve performance, especially with large datasets.

This aligns with the general rule to perform a single scan with multiple aggregations/operations using Polars for efficiency.

    df_with_spoiler = df_dedup_final.with_columns(
        pl.col("review_id").is_in(pl.Series(list(spoiler_ids))).cast(pl.Int8).alias("has_spoiler")
    )
References
  1. When calculating multiple statistics (e.g., count, null counts) from a dataset using Polars, perform a single scan with multiple aggregations in one select statement. This is more efficient than performing multiple separate scans or reads.

@fedecaselli fedecaselli merged commit 0423d25 into dev Mar 18, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants