Skip to content

Ner#147

Merged
yamirghofran merged 7 commits intodevfrom
NER
Mar 19, 2026
Merged

Ner#147
yamirghofran merged 7 commits intodevfrom
NER

Conversation

@leaabj
Copy link
Copy Markdown
Collaborator

@leaabj leaabj commented Mar 13, 2026

Adds name entity recognition to the book recommendation chatbot. When users mention specific book titles or author names, the system now:

  • Extracts book titles and author names from natural language using Groq LLM
  • Finds matching books/authors in the database with fuzzy matching (pg_trgm with Python fallback)
  • Generates context-aware recommendations based on real book metadata
  • Gracefully falls back to original behavior when entities not found

Closes #138

@leaabj leaabj requested a review from yamirghofran March 13, 2026 11:14
@leaabj leaabj marked this pull request as ready for review March 13, 2026 11:14
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Mar 13, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
bookdb-landing cf5486d Commit Preview URL

Branch Preview URL
Mar 19 2026, 09:41 AM

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the book recommendation chatbot by integrating Name Entity Recognition (NER). It enables the system to intelligently identify specific book titles and author names mentioned in user queries, then use this information to generate highly context-aware and accurate recommendations. This moves the chatbot beyond generic responses to a more personalized and precise recommendation experience, gracefully handling cases where entities are not found.

Highlights

  • Name Entity Recognition (NER) Implementation: Implemented Name Entity Recognition (NER) capabilities to automatically extract book titles and author names from user queries using a Groq Large Language Model (LLM).
  • Fuzzy Matching for Database Lookup: Introduced fuzzy matching logic, leveraging PostgreSQL's pg_trgm extension (with a Python difflib fallback), to accurately find corresponding books and authors in the database based on extracted entities.
  • Context-Aware Query Rewriting: Enhanced the chatbot's query rewriting mechanism to incorporate the extracted entity context, leading to more relevant and personalized book recommendations.
  • Configurable Entity Extraction Settings: Added new configuration settings to control various aspects of the entity extraction process, including enabling/disabling the feature, specifying the LLM model, setting similarity and confidence thresholds, and managing cache behavior.
  • Comprehensive Documentation and Testing: Provided detailed documentation for the new entity extraction feature and a dedicated test suite to ensure its correctness and robustness.
Changelog
  • apps/api/core/config.py
    • Added new configuration parameters for controlling entity extraction, including enabling/disabling the feature, specifying the LLM model, setting similarity and confidence thresholds, and configuring cache behavior.
  • apps/api/core/entity_extraction.py
    • Introduced a new module for LLM-based entity extraction, fuzzy database lookup for books and authors, context string generation, and entity resolution, complete with caching.
  • apps/api/routers/books.py
    • Integrated the new entity extraction logic into the chatbot's search pipeline, allowing the system to resolve entities from user queries and pass this context to the query rewriter.
  • bookdb/models/chatbot_llm.py
    • Updated LLM prompt templates and query rewriting functions to accept and utilize the extracted entity context, enabling more informed book descriptions.
  • context.md
    • Removed a markdown file that contained old SQL error logs.
  • docs/entity-extraction.md
    • Added detailed documentation for the new entity extraction feature, covering its architecture, components, usage examples, performance considerations, and testing.
  • tests/test_api/test_entity_extraction.py
    • Added a new test suite to validate the functionality of the entity extraction module, including string similarity, caching, and LLM integration.
Activity
  • The pull request was created by leaabj.
  • The description clearly outlines the intent and changes of the feature.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: Name Entity Recognition (NER) for the book recommendation chatbot. The implementation uses Groq LLM for entity extraction and fuzzy matching against the database to provide context-aware recommendations. The changes are well-structured, with a new module for entity extraction, updates to LLM prompts, and integration into the search pipeline. The addition of comprehensive documentation and tests is also a great step. My review focuses on improving configuration management, code clarity, performance, and test coverage to ensure the feature is robust and maintainable.

# Entity Extraction with LLM
# ============================================================================

_ENTITY_EXTRACTION_MODEL = "meta-llama/llama-4-scout-17b-16e-instruct"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Several values in this file are hardcoded instead of being read from the application settings, which makes configuration difficult.

  • On this line, _ENTITY_EXTRACTION_MODEL should come from settings.ENTITY_EXTRACTION_MODEL.
  • On line 160, the ttl for _entity_lookup_cache should use settings.ENTITY_CACHE_TTL.

To fix this, please import settings from ..core.config and use these settings values.

Comment on lines +1 to +171
"""Tests for entity extraction functionality.

Tests are designed to work with the existing test environment
and pytest configuration.
"""

import pytest
import os

# Test imports that work with project structure
from apps.api.core.entity_extraction import (
_string_similarity,
get_cache_stats,
clear_entity_cache,
)


# ============================================================================
# Unit Tests: String Similarity
# ============================================================================


def test_string_similarity_exact_match():
"""Test exact match returns 1.0."""
score = _string_similarity("Harry Potter", "Harry Potter")
assert score == pytest.approx(1.0, abs=0.01)


def test_string_similarity_case_insensitive():
"""Test case-insensitive matching."""
score = _string_similarity("Harry Potter", "harry potter")
assert score == pytest.approx(1.0, abs=0.01)


def test_string_similarity_partial_match():
"""Test partial matching."""
score = _string_similarity("Harry Potter", "Harry")
assert score > 0.5
assert score < 1.0


def test_string_similarity_no_match():
"""Test no match returns low score."""
score = _string_similarity("Harry Potter", "Lord of the Rings")
assert score < 0.3


def test_string_similarity_typo_tolerance():
"""Test typo tolerance."""
score = _string_similarity("Harry Potter", "Hary Potter")
assert score > 0.8


# ============================================================================
# Unit Tests: Cache Management
# ============================================================================


def test_cache_stats():
"""Get cache statistics."""
stats = get_cache_stats()

assert "size" in stats
assert "maxsize" in stats
assert "ttl" in stats

assert stats["size"] == 0 # Empty initially
assert stats["maxsize"] == 1000
assert stats["ttl"] == 3600


def test_clear_cache():
"""Clear entity cache."""
# Cache should be empty initially
stats_before = get_cache_stats()
assert stats_before["size"] == 0

# Clear cache (should work even if empty)
clear_entity_cache()

# Verify cache is still empty
stats_after = get_cache_stats()
assert stats_after["size"] == 0


# ============================================================================
# Tests: Edge Cases
# ============================================================================


def test_string_similarity_empty_strings():
"""Handle empty strings."""
score1 = _string_similarity("", "Harry Potter")
score2 = _string_similarity("Harry Potter", "")

assert score1 < 0.5
assert score2 < 0.5


def test_string_similarity_special_characters():
"""Handle special characters."""
score = _string_similarity("Book & Test", "Book and Test")
assert score > 0.8 # Should still match well


# ============================================================================
# LLM Integration Tests (Only run if GROQ_API_KEY is set)
# ============================================================================


@pytest.mark.skipif(
"GROQ_API_KEY" not in os.environ,
reason="LLM tests require GROQ_API_KEY environment variable",
)
def test_extract_book_entities_basic():
"""Test basic entity extraction (requires GROQ_API_KEY)."""
from apps.api.core.entity_extraction import extract_book_entities
from bookdb.models.chatbot_llm import create_groq_client_sync

if "GROQ_API_KEY" not in os.environ:
pytest.skip("GROQ_API_KEY not set")

client = create_groq_client_sync()
result = extract_book_entities("I love Harry Potter", client=client)

assert "book_titles" in result
assert "author_names" in result
assert "confidence" in result


@pytest.mark.skipif(
"GROQ_API_KEY" not in os.environ,
reason="LLM tests require GROQ_API_KEY environment variable",
)
def test_extract_book_entities_empty_query():
"""Handle empty queries (requires GROQ_API_KEY)."""
from apps.api.core.entity_extraction import extract_book_entities
from bookdb.models.chatbot_llm import create_groq_client_sync

if "GROQ_API_KEY" not in os.environ:
pytest.skip("GROQ_API_KEY not set")

client = create_groq_client_sync()
result = extract_book_entities("", client=client)

assert result.get("book_titles", []) == []
assert result.get("author_names", []) == []
# Low confidence for empty query
assert result.get("confidence", 0) < 0.5


# ============================================================================
# Tests: Context Generation
# ============================================================================


def test_get_book_context_without_db_session():
"""Generate context without database session."""
from apps.api.core.entity_extraction import get_book_context_string
from bookdb.db.models import Book

book = Book(
id=1,
goodreads_id=100,
title="Test Book",
description="Test description",
)
context = get_book_context_string(book, 0.8)

assert "TITLE: Test Book" in context
assert "DESCRIPTION: Test description" in context
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test coverage for this new feature is incomplete. While the existing tests are a good start, they don't cover critical database-dependent functionality like find_books_by_title, find_authors_by_name, or the main resolve_entities function. The documentation in docs/entity-extraction.md mentions a much more extensive test suite (41 tests), which suggests that more testing is intended. Please add tests for the fuzzy lookup and entity resolution logic to ensure the feature is robust and reliable.

# ============================================================================

_ENTITY_EXTRACTION_MODEL = "meta-llama/llama-4-scout-17b-16e-instruct"
_ENTITY_EXTRACTION_RETRIES = 2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The constant _ENTITY_EXTRACTION_RETRIES is defined but never used. It seems the retry logic is handled elsewhere with a different setting. To avoid confusion and dead code, this line should be removed.

Comment on lines +379 to +381
full_book = db.scalar(select(Book).where(Book.id == book.id))
if full_book:
book = full_book
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When fetching the full book object, you can eagerly load the authors and tags relationships to avoid potential N+1 queries later when accessing them. This is more efficient than the current approach which may lead to separate database queries for authors and tags. This will require importing selectinload from sqlalchemy.orm.

        full_book = db.scalar(
            select(Book)
            .where(Book.id == book.id)
            .options(
                selectinload(Book.authors).selectinload(BookAuthor.author),
                selectinload(Book.tags).selectinload(BookTag.tag),
            )
        )
        if full_book:
            book = full_book

Comment on lines +312 to +322
# Choose prompt based on whether we have entity context
prompt = (
BOOK_DESCRIPTION_WITH_CONTEXT_PROMPT
if entity_context
else BOOK_DESCRIPTION_PROMPT
)

# Build system message
system_content = prompt
if entity_context:
system_content = system_content.format(entity_context=entity_context)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code for preparing the system prompt is a duplicate of the logic in the async version _rewrite_description (lines 203-213). To improve maintainability and reduce redundancy, consider extracting this logic into a shared helper function.

For example, you could create a function like this:

def _prepare_description_prompt(entity_context: Optional[str] = None) -> str:
    """Prepare the system prompt for description rewriting."""
    prompt = (
        BOOK_DESCRIPTION_WITH_CONTEXT_PROMPT
        if entity_context
        else BOOK_DESCRIPTION_PROMPT
    )
    if entity_context:
        return prompt.format(entity_context=entity_context)
    return prompt

Then, both _rewrite_description and _rewrite_description_sync could be simplified by replacing this block with system_content = _prepare_description_prompt(entity_context).

Copy link
Copy Markdown
Owner

@yamirghofran yamirghofran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Great work.

@yamirghofran
Copy link
Copy Markdown
Owner

Tests pass
CleanShot 2026-03-16 at 10 23 15@2x

@yamirghofran
Copy link
Copy Markdown
Owner

You can merge.

@yamirghofran yamirghofran merged commit e56fd09 into dev Mar 19, 2026
4 checks passed
@leaabj leaabj deleted the NER branch March 23, 2026 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants