Skip to content

Feat/clustering#145

Merged
catebros merged 4 commits intodevfrom
feat/clustering
Mar 11, 2026
Merged

Feat/clustering#145
catebros merged 4 commits intodevfrom
feat/clustering

Conversation

@catebros
Copy link
Copy Markdown
Collaborator

  1. Collect all books per user: ratings >=4 , shelved books, listed books
  2. Cluster the embeddings
  3. Recommend per cluster: compute each cluster's center and query with it

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Mar 10, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
bookdb-landing 230e6aa Commit Preview URL

Branch Preview URL
Mar 10 2026, 07:46 PM

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the book recommendation system by introducing a clustering-based approach. Instead of generating recommendations from a single user profile vector, the system now clusters a user's interaction history into distinct interest groups. Recommendations are then generated for each cluster's centroid, aiming to provide more diverse and relevant suggestions by capturing multiple facets of a user's reading preferences. A fallback mechanism ensures that the previous recommendation method is used if clustering is not applicable or yields no results.

Highlights

  • Clustering-based Recommendations: Implemented a new book recommendation strategy that clusters user interaction history to provide more diverse and relevant suggestions.
  • Vector Search Enhancements: Modified the most_similar_by_vector function to allow excluding specific book IDs from similarity searches, improving recommendation precision.
  • Embedding Vector Retrieval: Added a new utility function, get_vectors_by_ids, to efficiently fetch embedding vectors for a given list of book IDs from the Qdrant vector database.
  • New Clustering Module: Introduced a dedicated module for K-means clustering of book embeddings, enabling the grouping of similar books based on their vector representations.
Changelog
  • apps/api/core/embeddings.py
    • Modified most_similar_by_vector to accept an exclude_ids parameter, allowing specific book IDs to be excluded from similarity searches.
    • Added a new function get_vectors_by_ids to retrieve embedding vectors for a list of goodreads IDs from Qdrant.
  • apps/api/routers/discovery.py
    • Imported zip_longest for combining results from multiple clusters.
    • Imported cluster_seeds_by_embedding from the new clustering module.
    • Imported most_similar_by_vector and get_vectors_by_ids from the embeddings module.
    • Defined MIN_SEEDS_FOR_CLUSTERING constant.
    • Implemented _cluster_vector_recommendations to orchestrate the clustering process, including collecting user interaction seeds, fetching vectors, performing K-means clustering, and generating recommendations per cluster.
    • Updated get_recommendations to prioritize the new _cluster_vector_recommendations method, falling back to the existing _interaction_vector_recommendations if clustering is not feasible or yields no results.
  • bookdb/vector_db/clustering.py
    • Added a new file containing the cluster_seeds_by_embedding function, which performs K-means clustering on book embedding vectors with optional weighting.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new recommendation strategy based on clustering user interaction embeddings. The user's book interactions are clustered, and recommendations are generated from the center of each cluster. The changes include new functions for fetching vectors from Qdrant, a clustering utility, and the main recommendation logic that integrates this new strategy with a fallback to the previous method.

My review focuses on improving robustness and maintainability. The main points are:

  • Replacing broad except Exception blocks with more specific exception handling and logging to avoid silencing errors.
  • Refactoring duplicated code to improve readability.
  • Highlighting a potentially brittle approach to handling named vectors in Qdrant.

Comment on lines +161 to +164
try:
vector_map = get_vectors_by_ids(qdrant_client, list(seed_scores.keys()))
except Exception:
return []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a broad except Exception can mask important errors and make debugging difficult. It's better to catch more specific exceptions that you expect from get_vectors_by_ids. Consider also logging the exception to ensure failures are not silent.

Comment on lines +171 to +174
try:
clusters = cluster_seeds_by_embedding(valid_seeds, seed_scores, n_clusters)
except Exception:
return []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This try...except block also uses a broad except Exception, which can hide errors occurring during the clustering process. Please catch more specific exceptions and consider logging any errors for better observability.

Comment on lines +197 to +204
try:
hits = most_similar_by_vector(
qdrant_client,
query_vector=centroid.tolist(),
top_k=per_cluster_limits[cluster_idx],
exclude_ids=cluster_excluded,
)
except Exception:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This is another case of catching a broad Exception, which can silence errors from most_similar_by_vector. This makes it hard to diagnose issues with Qdrant queries. It's recommended to handle specific exceptions and log them.

Comment on lines +105 to +106
if isinstance(vector, dict):
vector = next(iter(vector.values()), None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic to handle named vectors by picking the first one from the dictionary can be brittle. If multiple named vectors exist, the one chosen is arbitrary and depends on dictionary insertion order. It would be more robust to either expect a specific vector name or handle the case of multiple vectors more explicitly. If only one vector is ever expected, adding a comment to clarify this assumption would be helpful.

Comment on lines +296 to +311
interaction_goodreads_ids = _cluster_vector_recommendations(
db,
current_user.id,
qdrant_client=qdrant,
limit=max(limit * 4, 80),
exclude_ids=set(bpr_goodreads_ids),
)
if not interaction_goodreads_ids:
# Fall back
interaction_goodreads_ids = _interaction_vector_recommendations(
db,
current_user.id,
qdrant_client=qdrant,
limit=max(limit * 4, 80),
exclude_ids=set(bpr_goodreads_ids),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The arguments passed to _cluster_vector_recommendations and the fallback _interaction_vector_recommendations are identical. To improve readability and avoid repetition, you can define the arguments once in a dictionary and unpack it for both function calls.

        reco_args = {
            "db": db,
            "user_id": current_user.id,
            "qdrant_client": qdrant,
            "limit": max(limit * 4, 80),
            "exclude_ids": set(bpr_goodreads_ids),
        }
        interaction_goodreads_ids = _cluster_vector_recommendations(**reco_args)
        if not interaction_goodreads_ids:
            # Fall back
            interaction_goodreads_ids = _interaction_vector_recommendations(**reco_args)

Copy link
Copy Markdown
Collaborator

@fedecaselli fedecaselli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very meaningful

Copy link
Copy Markdown
Owner

@yamirghofran yamirghofran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work

@catebros catebros merged commit b8ecb4a into dev Mar 11, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants