Feat/reviews final clean by fedecaselli · Pull Request #148 · yamirghofran/BookDB

fedecaselli · 2026-03-17T12:42:03Z

No description provided.

cloudflare-workers-and-pages · 2026-03-17T12:42:08Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	bookdb-landing	`d3a6015`	Commit Preview URL Branch Preview URL	Mar 12 2026, 11:01 AM

gemini-code-assist · 2026-03-17T12:42:20Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data processing pipeline by introducing a new step for cleaning and flagging Goodreads reviews. It focuses on improving data quality by systematically identifying and removing uninformative entries, and by clearly marking reviews that contain spoilers. These changes ensure that downstream models and analyses will operate on a more refined and feature-rich dataset. Additionally, it includes minor fixes to data pathing in existing exploratory data analysis notebooks and updates the data version control metadata to reflect the new state of the processed data.

Highlights

Data Version Control Update: The DVC metadata for the 'data' directory was updated, reflecting an increase in dataset size and file count.
Pathing Corrections in EDA Notebooks: Adjustments were made to the 'project_root' path calculation in two existing EDA notebooks to ensure correct data loading.
New Review Processing Notebook: A new Marimo notebook was introduced to perform comprehensive cleaning and spoiler flagging of Goodreads reviews. This includes dropping non-informative reviews and adding a 'has_spoiler' column.

Changelog

data.dvc
- Updated DVC metadata to reflect an increase in the size and number of files within the 'data' directory.
notebooks/data/eda/reviews/reviews_dedup_eda.py
- Corrected the 'project_root' path calculation by removing a redundant assignment.
notebooks/data/eda/reviews/reviews_spoiler_eda.py
- Adjusted the 'project_root' path calculation to correctly resolve the project root.
notebooks/data/processing/reviews/5_reduce_and_flag_reviews.py
- Added a new Marimo notebook for review data cleaning and spoiler flagging.
- Implemented logic to drop the 'date_updated' column if present.
- Introduced functions to identify and filter out empty, whitespace-only, single-character, punctuation-only, numbers-only, and repeated-character reviews.
- Included a check for short reviews with low character variety.
- Verified for duplicate 'review_id' entries in both dedup and spoiler datasets.
- Applied combined filtering to remove non-informative and short+low-variety reviews from the dedup dataset.
- Saved the IDs of dropped reviews to a JSON file.
- Added a 'has_spoiler' flag to reviews based on their presence in the spoiler dataset.
- Implemented validation checks for row counts, 'has_spoiler' values, and null/empty 'review_id' and 'review_text' fields.
- Saved the final cleaned dataset to a new Parquet file.

Activity

No specific activity (comments, reviews, progress) was provided in the context for this pull request.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new data processing script to clean and flag review data, which is a valuable addition to the data pipeline. The script effectively handles empty, non-informative, and duplicate reviews, and correctly adds a spoiler flag. The validation checks at the end ensure data integrity. However, there are opportunities to improve performance by leveraging native Polars expressions instead of Python UDFs for filtering and column creation, especially for large datasets.

gemini-code-assist · 2026-03-17T12:43:58Z

notebooks/data/eda/reviews/reviews_spoiler_eda.py

 def _(mo, pl):
    import os
-    project_root = __import__("pathlib").Path(__file__).resolve().parents[3]
+    project_root = __import__("pathlib").Path(__file__).resolve().parents[4]


The parents level for project_root was changed from 3 to 4. Please verify if this change correctly resolves the project root path for this specific notebook. An incorrect path could lead to data loading errors.

gemini-code-assist · 2026-03-17T12:43:58Z

notebooks/data/processing/reviews/5_reduce_and_flag_reviews.py

+
+    # Get reviews to drop
+    df_to_drop = df_dedup_clean.filter(
+        pl.col("review_text").map_elements(_should_drop, return_dtype=pl.Boolean)


Using map_elements with Python functions (_should_drop, is_non_informative, is_short_low_variety) can be significantly slower than using native Polars expressions, especially for large datasets. Consider rewriting these conditions using Polars expressions for better performance. For example, len(text) == 1 can be pl.col("review_text").str.len_chars() == 1, and regex checks can use pl.col("review_text").str.contains() or pl.col("review_text").str.full_match().

This aligns with the general rule to perform a single scan with multiple aggregations/operations using Polars for efficiency.

df_to_drop = df_dedup_clean.filter( (pl.col("review_text").str.len_chars() == 1) | (pl.col("review_text").str.full_match(r'^[\W_]+$')) | (pl.col("review_text").str.full_match(r'^[\d\s]+$')) | (pl.col("review_text").str.full_match(r'^(.)\1+$')) | ((pl.col("review_text").str.len_chars() <= 5) & (pl.col("review_text").str.replace_all(" ", "").str.n_unique_chars() <= 2)) ) df_dedup_final = df_dedup_clean.filter( ~((pl.col("review_text").str.len_chars() == 1) | (pl.col("review_text").str.full_match(r'^[\W_]+$')) | (pl.col("review_text").str.full_match(r'^[\d\s]+$')) | (pl.col("review_text").str.full_match(r'^(.)\1+$')) | ((pl.col("review_text").str.len_chars() <= 5) & (pl.col("review_text").str.replace_all(" ", "").str.n_unique_chars() <= 2))) )

References

When calculating multiple statistics (e.g., count, null counts) from a dataset using Polars, perform a single scan with multiple aggregations in one select statement. This is more efficient than performing multiple separate scans or reads.

gemini-code-assist · 2026-03-17T12:43:58Z

notebooks/data/processing/reviews/5_reduce_and_flag_reviews.py

+        pl.col("review_id").map_elements(
+            lambda x: 1 if x in spoiler_ids else 0,
+            return_dtype=pl.Int8
+        ).alias("has_spoiler")
+    )


Using map_elements with a lambda function for checking review_id in spoiler_ids can be inefficient. Polars offers a native is_in expression which is highly optimized for this type of operation. This will improve performance, especially with large datasets.

This aligns with the general rule to perform a single scan with multiple aggregations/operations using Polars for efficiency.

df_with_spoiler = df_dedup_final.with_columns( pl.col("review_id").is_in(pl.Series(list(spoiler_ids))).cast(pl.Int8).alias("has_spoiler") )

References

When calculating multiple statistics (e.g., count, null counts) from a dataset using Polars, perform a single scan with multiple aggregations in one select statement. This is more efficient than performing multiple separate scans or reads.

fedecaselli and others added 3 commits March 9, 2026 21:26

feat: add review reduction and spoiler flagging notebook

7685125

fix: remove unnecessary cells

f2df897

data(dvc): add review embeddings to dvc

d3a6015

fedecaselli requested a review from catebros March 17, 2026 12:42

gemini-code-assist bot reviewed Mar 17, 2026

View reviewed changes

catebros approved these changes Mar 18, 2026

View reviewed changes

fedecaselli merged commit 0423d25 into dev Mar 18, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/reviews final clean#148

Feat/reviews final clean#148
fedecaselli merged 3 commits intodevfrom
feat/reviews-final-clean

fedecaselli commented Mar 17, 2026

Uh oh!

cloudflare-workers-and-pages bot commented Mar 17, 2026

Uh oh!

gemini-code-assist bot commented Mar 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 17, 2026

Uh oh!

gemini-code-assist bot Mar 17, 2026

Uh oh!

gemini-code-assist bot Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fedecaselli commented Mar 17, 2026

Uh oh!

cloudflare-workers-and-pages bot commented Mar 17, 2026

Deploying with Cloudflare Workers

Uh oh!

gemini-code-assist bot commented Mar 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants