Conversation
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
bookdb-landing | d3a6015 | Commit Preview URL Branch Preview URL |
Mar 12 2026, 11:01 AM |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the data processing pipeline by introducing a new step for cleaning and flagging Goodreads reviews. It focuses on improving data quality by systematically identifying and removing uninformative entries, and by clearly marking reviews that contain spoilers. These changes ensure that downstream models and analyses will operate on a more refined and feature-rich dataset. Additionally, it includes minor fixes to data pathing in existing exploratory data analysis notebooks and updates the data version control metadata to reflect the new state of the processed data. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new data processing script to clean and flag review data, which is a valuable addition to the data pipeline. The script effectively handles empty, non-informative, and duplicate reviews, and correctly adds a spoiler flag. The validation checks at the end ensure data integrity. However, there are opportunities to improve performance by leveraging native Polars expressions instead of Python UDFs for filtering and column creation, especially for large datasets.
| def _(mo, pl): | ||
| import os | ||
| project_root = __import__("pathlib").Path(__file__).resolve().parents[3] | ||
| project_root = __import__("pathlib").Path(__file__).resolve().parents[4] |
|
|
||
| # Get reviews to drop | ||
| df_to_drop = df_dedup_clean.filter( | ||
| pl.col("review_text").map_elements(_should_drop, return_dtype=pl.Boolean) |
There was a problem hiding this comment.
Using map_elements with Python functions (_should_drop, is_non_informative, is_short_low_variety) can be significantly slower than using native Polars expressions, especially for large datasets. Consider rewriting these conditions using Polars expressions for better performance. For example, len(text) == 1 can be pl.col("review_text").str.len_chars() == 1, and regex checks can use pl.col("review_text").str.contains() or pl.col("review_text").str.full_match().
This aligns with the general rule to perform a single scan with multiple aggregations/operations using Polars for efficiency.
df_to_drop = df_dedup_clean.filter(
(pl.col("review_text").str.len_chars() == 1) |
(pl.col("review_text").str.full_match(r'^[\W_]+$')) |
(pl.col("review_text").str.full_match(r'^[\d\s]+$')) |
(pl.col("review_text").str.full_match(r'^(.)\1+$')) |
((pl.col("review_text").str.len_chars() <= 5) & (pl.col("review_text").str.replace_all(" ", "").str.n_unique_chars() <= 2))
)
df_dedup_final = df_dedup_clean.filter(
~((pl.col("review_text").str.len_chars() == 1) |
(pl.col("review_text").str.full_match(r'^[\W_]+$')) |
(pl.col("review_text").str.full_match(r'^[\d\s]+$')) |
(pl.col("review_text").str.full_match(r'^(.)\1+$')) |
((pl.col("review_text").str.len_chars() <= 5) & (pl.col("review_text").str.replace_all(" ", "").str.n_unique_chars() <= 2)))
)References
- When calculating multiple statistics (e.g., count, null counts) from a dataset using Polars, perform a single scan with multiple aggregations in one
selectstatement. This is more efficient than performing multiple separate scans or reads.
| pl.col("review_id").map_elements( | ||
| lambda x: 1 if x in spoiler_ids else 0, | ||
| return_dtype=pl.Int8 | ||
| ).alias("has_spoiler") | ||
| ) |
There was a problem hiding this comment.
Using map_elements with a lambda function for checking review_id in spoiler_ids can be inefficient. Polars offers a native is_in expression which is highly optimized for this type of operation. This will improve performance, especially with large datasets.
This aligns with the general rule to perform a single scan with multiple aggregations/operations using Polars for efficiency.
df_with_spoiler = df_dedup_final.with_columns(
pl.col("review_id").is_in(pl.Series(list(spoiler_ids))).cast(pl.Int8).alias("has_spoiler")
)References
- When calculating multiple statistics (e.g., count, null counts) from a dataset using Polars, perform a single scan with multiple aggregations in one
selectstatement. This is more efficient than performing multiple separate scans or reads.
No description provided.