Skip to content

Feat/extract metadata#851

Merged
keeees merged 6 commits intodevelopfrom
feat/extract-metadata
Apr 10, 2026
Merged

Feat/extract metadata#851
keeees merged 6 commits intodevelopfrom
feat/extract-metadata

Conversation

@lanceyq
Copy link
Copy Markdown
Collaborator

@lanceyq lanceyq commented Apr 9, 2026

Summary by Sourcery

添加一个异步流水线,用于从与用户相关的语句中提取并持久化用户元数据和别名,并将其接入去重后的知识抽取与搜索栈。

New Features:

  • 引入一个 Celery 任务,用于提取、合并并持久化用户元数据和别名,包括 Neo4j 别名同步。
  • 新增一个元数据抽取引擎,从图中收集与用户相关的语句,并调用专用的 LLM 提示词以生成结构化的用户元数据。
  • 定义专用的用户元数据 Pydantic 模型,以及一个用于元数据抽取的新 LLM 提示词模板。

Bug Fixes:

  • 修复 end_user_info 创建过程中的别名更新逻辑,通过避免写入空元数据记录,并将别名处理集中到元数据任务中。

Enhancements:

  • 将元数据抽取接入抽取编排器,使与用户相关的语句触发新的异步任务,而不是内联别名同步。
  • 在分块和语句中向 Neo4j 传递说话人信息,以区分用户内容与助手内容。
  • 改进 Neo4j 搜索 API 和 Cypher 查询,使用命名参数 query,并转义包括 / 在内的 Lucene 特殊字符,以减少搜索错误。
  • 确保 Neo4j 中特殊的用户/助手实体对每个终端用户复用稳定的节点 ID,避免产生重复节点。
  • 扩展性能日志记录,并对搜索与图相关操作中的日志消息做了一些小的清理与优化。
Original summary in English

Summary by Sourcery

Add an asynchronous pipeline to extract and persist user metadata and aliases from user-related statements, and wire it into the post-dedup knowledge extraction and search stack.

New Features:

  • Introduce a Celery task to extract, merge, and persist user metadata and aliases, including Neo4j alias synchronization.
  • Add a metadata extraction engine that collects user-related statements from the graph and calls a dedicated LLM prompt to produce structured user metadata.
  • Define dedicated user metadata Pydantic models and a new LLM prompt template for metadata extraction.

Bug Fixes:

  • Fix alias update logic for end_user_info creation by avoiding writing empty metadata records and centralizing alias handling in the metadata task.

Enhancements:

  • Wire metadata extraction into the extraction orchestrator so user-related statements trigger the new async task instead of inline alias syncing.
  • Propagate speaker information through chunks and statements into Neo4j to distinguish user vs assistant content.
  • Improve Neo4j search APIs and Cypher queries to use a named query parameter and escape Lucene special characters, including '/', to reduce search errors.
  • Ensure special user/assistant entities in Neo4j reuse stable node IDs per end user to avoid duplicates.
  • Extend performance logging and minor log message cleanups across search and graph operations.

lanceyq added 4 commits April 9, 2026 11:01
- Add MetadataExtractor to collect user-related statements post-dedup
  and extract profile/behavioral metadata via independent LLM call
- Add Celery task (extract_user_metadata) routed to memory_tasks queue
- Add metadata models (UserMetadata, UserMetadataProfile, etc.)
- Add metadata utility functions (clean, validate, merge with _op support)
- Add Jinja2 prompt template for metadata extraction (zh/en)
- Fix Lucene query parameter naming: rename `q` to `query` across all
  Cypher queries, graph_search functions, and callers
- Escape `/` in Lucene queries to prevent TokenMgrError
- Add `speaker` field to ChunkNode and persist it in Neo4j
- Remove unused imports (argparse, os, UUID) in search.py
- Fix unnecessary db context nesting in interest distribution task
…sed merge

- Remove merge_metadata and its helper functions from metadata_utils.py
- Pass existing_metadata to MetadataExtractor.extract_metadata() as LLM context
- Add merge instructions to extract_user_metadata.jinja2 prompt (zh/en)
- Update Celery task to read existing metadata before extraction and overwrite
- Simplify field descriptions in UserMetadataProfile model
- Add _update_timestamps helper to track changed fields
…licate user entity nodes

- Merge alias add/remove into MetadataExtractionResponse and Celery metadata task,
  removing the separate sync step from extraction_orchestrator
- Replace first-person pronouns ("我") with "用户" in statement extraction to
  preserve identity semantics for downstream metadata/alias extraction
- Update extract_statement.jinja2 prompt to enforce "用户" as subject for user
  statements instead of resolving to real names
- Add alias change instructions (aliases_to_add/aliases_to_remove) to
  extract_user_metadata.jinja2 with incremental merge logic
- Deduplicate special entities ("用户", "AI助手") in graph_saver by reusing
  existing Neo4j node IDs per end_user_id
- Sync final aliases from PgSQL to Neo4j user entity nodes after metadata write
…metadata utils

- Remove _replace_first_person_with_user from StatementExtractor to preserve
  original user text for downstream metadata/alias extraction
- Delete metadata_utils.py module, inline clean_metadata into Celery task
- Remove unused imports and commented-out collect_user_raw_messages method
- Apply formatting cleanup across metadata models and extraction orchestrator
@lanceyq lanceyq requested a review from keeees April 9, 2026 16:30
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Apr 9, 2026

审阅者指南(Reviewer's Guide)

添加一个由新的基于 LLM 的 MetadataExtractor 驱动的异步用户元数据抽取流水线,将其接入抽取编排器和 Celery,增强 Neo4j 图建模(speaker、user/assistant 规范化、Lucene 转义),并重构关键词搜索和别名处理,以便在 Postgres 和 Neo4j 之间一致地维护用户元数据和别名。

异步用户元数据抽取流水线的时序图

sequenceDiagram
    actor User
    participant Orchestrator as ExtractionOrchestrator
    participant MetadataExtractor as MetadataExtractor
    participant CeleryBroker as CeleryBroker
    participant CeleryWorker as CeleryWorker
    participant PG as PostgresDB
    participant EndUserRepo as EndUserRepository
    participant EndUserInfoRepo as EndUserInfoRepository
    participant MemoryConfigService as MemoryConfigService
    participant MemoryClientFactory as MemoryClientFactory
    participant LLMClient as LLMClient
    participant Neo4j as Neo4jConnector

    User->>Orchestrator: run()
    Orchestrator->>MetadataExtractor: collect_user_related_statements(entity_nodes, statement_nodes, statement_entity_edges)
    MetadataExtractor-->>Orchestrator: user_statements

    alt has_user_statements
        Orchestrator->>CeleryBroker: enqueue extract_user_metadata_task(end_user_id, statements, config_id, language)
        CeleryBroker-->>CeleryWorker: dispatch extract_user_metadata_task

        CeleryWorker->>PG: get_db_context()
        CeleryWorker->>EndUserRepo: get_by_id(end_user_id)
        EndUserRepo-->>CeleryWorker: end_user
        CeleryWorker->>MemoryConfigService: get_config_with_fallback(memory_config_id, workspace_id)
        MemoryConfigService-->>CeleryWorker: memory_config
        CeleryWorker->>MemoryClientFactory: get_llm_client(llm_id)
        MemoryClientFactory-->>CeleryWorker: LLMClient

        CeleryWorker->>EndUserInfoRepo: get_by_end_user_id(end_user_id)
        EndUserInfoRepo-->>CeleryWorker: existing_info(meta_data, aliases)

        CeleryWorker->>MetadataExtractor: extract_metadata(statements, existing_metadata, existing_aliases)
        MetadataExtractor->>LLMClient: response_structured(prompt, MetadataExtractionResponse)
        LLMClient-->>MetadataExtractor: MetadataExtractionResponse(user_metadata, aliases_to_add, aliases_to_remove)
        MetadataExtractor-->>CeleryWorker: user_metadata, aliases_to_add, aliases_to_remove

        CeleryWorker->>CeleryWorker: clean_metadata(user_metadata)
        CeleryWorker->>CeleryWorker: _update_timestamps(existing_meta, cleaned_meta, updated_at, now)
        CeleryWorker->>PG: update EndUserInfo.meta_data, EndUserInfo.aliases, EndUser.other_name
        PG-->>CeleryWorker: commit

        CeleryWorker->>Neo4j: execute_query(MATCH ExtractedEntity user placeholders SET aliases)
        Neo4j-->>CeleryWorker: ok

        CeleryWorker-->>CeleryBroker: task result(status=SUCCESS)
        CeleryBroker-->>Orchestrator: async completion (logged only)
    else no_user_statements
        Orchestrator->>Orchestrator: skip metadata extraction
    end
Loading

元数据抽取及相关模型的类图

classDiagram
    class MetadataExtractor {
        - llm_client
        - language : str
        + MetadataExtractor(llm_client, language : str)
        + detect_language(statements : List~str~) str
        + collect_user_related_statements(entity_nodes : List~ExtractedEntityNode~, statement_nodes : List~StatementNode~, statement_entity_edges : List~StatementEntityEdge~) List~str~
        + extract_metadata(statements : List~str~, existing_metadata : dict, existing_aliases : List~str~) tuple
    }

    class ExtractedEntityNode {
        + id : str
        + name : str
        + entity_type : str
        + end_user_id : str
    }

    class StatementNode {
        + id : str
        + statement : str
        + speaker : str
        + end_user_id : str
    }

    class StatementEntityEdge {
        + source : str
        + target : str
    }

    class UserMetadataProfile {
        + role : str
        + domain : str
        + expertise : List~str~
        + interests : List~str~
    }

    class UserMetadataBehavioralHints {
        + learning_stage : str
        + preferred_depth : str
        + tone_preference : str
    }

    class UserMetadata {
        + profile : UserMetadataProfile
        + behavioral_hints : UserMetadataBehavioralHints
        + knowledge_tags : List~str~
    }

    class MetadataExtractionResponse {
        + user_metadata : UserMetadata
        + aliases_to_add : List~str~
        + aliases_to_remove : List~str~
    }

    class ChunkNode {
        + id : str
        + dialog_id : str
        + content : str
        + speaker : str
        + chunk_embedding : List~float~
        + sequence_number : int
        + metadata : dict
    }

    MetadataExtractor --> UserMetadata : returns
    MetadataExtractionResponse --> UserMetadata : contains
    UserMetadata --> UserMetadataProfile : has
    UserMetadata --> UserMetadataBehavioralHints : has

    MetadataExtractor --> ExtractedEntityNode : reads
    MetadataExtractor --> StatementNode : reads
    MetadataExtractor --> StatementEntityEdge : reads

    StatementNode --> ChunkNode : derived_from

    class EndUserInfo {
        + end_user_id : str
        + other_name : str
        + aliases : List~str~
        + meta_data : dict
    }

    class EndUser {
        + id : str
        + workspace_id : str
        + other_name : str
    }

    EndUserInfo --> EndUser : references
    MetadataExtractor --> MetadataExtractionResponse : uses
    EndUserInfo --> UserMetadata : stores_in_meta_data
Loading

文件级改动(File-Level Changes)

Change Details Files
引入基于 LLM 的用户元数据抽取模型、提示词和 Celery 任务,并在图去重之后将其接入抽取流水线。
  • 新增用于用户元数据和元数据抽取响应的 Pydantic 模型,包括个人档案(profile)、行为倾向提示(behavioral hints)和知识标签(knowledge tags)。
  • 创建专门的 Jinja2 提示模板,根据用户陈述以及已有元数据/别名,抽取用户元数据和别名增量(新增/删除)。
  • 实现 MetadataExtractor,从图节点/边中收集与用户相关的陈述,并调用 LLM 生成结构化的元数据和别名变更。
  • 新增 Celery 任务 extract_user_metadata_task,加载带工作区回退的 LLM 配置,调用 MetadataExtractor,清洗并合并元数据,维护逐字段的 _updated_at 时间戳,更新 end_user_info/end_user 的 aliases 和 other_name,并将别名同步回 Neo4j。
  • memory_tasks 队列上注册新 Celery 任务,并在非 pilot 运行中于去重之后使用收集到的用户相关陈述从抽取编排器中触发该任务。
api/app/core/memory/models/metadata_models.py
api/app/core/memory/utils/prompt/prompts/extract_user_metadata.jinja2
api/app/core/memory/storage_services/extraction_engine/knowledge_extraction/metadata_extractor.py
api/app/tasks.py
api/app/celery_app.py
api/app/core/memory/storage_services/extraction_engine/extraction_orchestrator.py
api/app/core/memory/models/__init__.py
在陈述抽取和图存储中规范 user/assistant 的处理,增加 speaker 传递,并对用户实体和代词做特殊处理。
  • 扩展 ChunkNode、Cypher 插入语句以及 _create_nodes_and_edges,将 speaker 字段从 chunks 传递到 Neo4j 的 Chunk 和 Statement 节点。
  • 调整陈述抽取,在处理 chunk 开始时获取 speaker,并在创建 Statement 对象时一致使用该 speaker。
  • 修改陈述抽取提示词,对于用户发言统一使用通用主体“用户”,对于助手发言使用“助手/AI助手”,并相应更新示例。
  • 在保存到 Neo4j 之前预处理 entity_nodes,对每个 end_user 复用诸如 “用户/我/user/i” 和 “AI 助手/assistant” 等特殊实体的 ID,并更新所有受影响的边,以避免重复的特殊节点。
api/app/core/memory/models/graph_models.py
api/app/repositories/neo4j/cypher_queries.py
api/app/core/memory/storage_services/extraction_engine/extraction_orchestrator.py
api/app/core/memory/storage_services/extraction_engine/knowledge_extraction/statement_extraction.py
api/app/core/memory/utils/prompt/prompts/extract_statement.jinja2
api/app/repositories/neo4j/graph_saver.py
优化 Neo4j 关键词/感知搜索 API,使用已转义的查询参数和统一的查询参数名,在包含特殊字符时提升鲁棒性。
  • 在搜索层中,将 search_graphsearch_perceptual 的参数 q 统一重命名为 query,并更新混合搜索、关键词搜索和感知检索节点中的所有调用。
  • 更新所有 Cypher 全文搜索查询,使用 $query 参数替代 $q,并统一调整别名排序逻辑。
  • search_graphsearch_graph_by_keyword_temporal、感知搜索和实体名称查询中使用 escape_lucene_query 对用户查询字符串进行预转义,避免 Lucene TokenMgrError(包括对 / 的处理)。
  • 增强 escape_lucene_query,在现有 Lucene 特殊符号的基础上新增对 / 字符的转义。
api/app/repositories/neo4j/graph_search.py
api/app/repositories/neo4j/cypher_queries.py
api/app/core/memory/src/search.py
api/app/core/memory/storage_services/search/keyword_search.py
api/app/core/memory/agent/langgraph_graph/nodes/perceptual_retrieve_node.py
api/app/core/memory/utils/data/text_utils.py
调整现有兴趣初始化和别名同步逻辑,将别名更新委托给新的元数据任务,并进行少量日志/结果字段微调。
  • 移除兴趣分布缓存初始化循环外层的数据库会话封装,保持异步缓存操作和计数器逻辑不变。
  • 在增量聚类任务的成功返回中新增 end_user_details 字段,以提供更丰富的报表信息。
  • 移除 _update_end_user_other_name 中直接创建 alias/meta_data 的逻辑,因为别名同步现在由元数据抽取流程负责。
  • 进行少量日志与消息清理(统一字符串字面量、简化 PERF 日志、警告信息等)。
api/app/tasks.py
api/app/core/memory/storage_services/extraction_engine/extraction_orchestrator.py
api/app/core/memory/src/search.py
api/app/repositories/neo4j/graph_search.py

技巧与命令(Tips and commands)

与 Sourcery 交互(Interacting with Sourcery)

  • 触发新的审查: 在 Pull Request 中评论 @sourcery-ai review
  • 继续讨论: 直接回复 Sourcery 的审查评论。
  • 从审查评论生成 GitHub Issue: 在审查评论下回复,要求 Sourcery 从该评论创建一个 issue。你也可以直接回复 @sourcery-ai issue,从该评论创建 issue。
  • 生成 Pull Request 标题: 在 Pull Request 标题的任意位置写上 @sourcery-ai 即可随时生成标题。你也可以在 Pull Request 中评论 @sourcery-ai title 来(重新)生成标题。
  • 生成 Pull Request 摘要: 在 Pull Request 正文的任意位置写上 @sourcery-ai summary,即可在对应位置生成 PR 摘要。你也可以评论 @sourcery-ai summary 来在任意时间(重新)生成摘要。
  • 生成审阅者指南: 在 Pull Request 中评论 @sourcery-ai guide,即可在任意时间(重新)生成审阅者指南。
  • 一次性解决所有 Sourcery 评论: 在 Pull Request 中评论 @sourcery-ai resolve,以解决所有 Sourcery 评论。如果你已经处理完所有评论且不希望再看到它们,这很有用。
  • 驳回所有 Sourcery 审查: 在 Pull Request 中评论 @sourcery-ai dismiss,驳回所有现有的 Sourcery 审查。特别适用于你想从一个全新的审查开始的情况——别忘了再评论 @sourcery-ai review 来触发新的审查!

自定义你的体验(Customizing Your Experience)

访问你的 控制台(dashboard) 来:

  • 启用或禁用审查功能,例如 Sourcery 自动生成的 Pull Request 摘要、审阅者指南等。
  • 更改审查语言。
  • 添加、删除或编辑自定义审查指令。
  • 调整其他审查相关设置。

获取帮助(Getting Help)

Original review guide in English

Reviewer's Guide

Adds an asynchronous user metadata extraction pipeline powered by a new LLM-based MetadataExtractor, wires it into the extraction orchestrator and Celery, enriches Neo4j graph modeling (speaker, user/assistant canonicalization, Lucene escaping), and refactors keyword search and alias handling so user metadata and aliases are maintained consistently across Postgres and Neo4j.

Sequence diagram for async user metadata extraction pipeline

sequenceDiagram
    actor User
    participant Orchestrator as ExtractionOrchestrator
    participant MetadataExtractor as MetadataExtractor
    participant CeleryBroker as CeleryBroker
    participant CeleryWorker as CeleryWorker
    participant PG as PostgresDB
    participant EndUserRepo as EndUserRepository
    participant EndUserInfoRepo as EndUserInfoRepository
    participant MemoryConfigService as MemoryConfigService
    participant MemoryClientFactory as MemoryClientFactory
    participant LLMClient as LLMClient
    participant Neo4j as Neo4jConnector

    User->>Orchestrator: run()
    Orchestrator->>MetadataExtractor: collect_user_related_statements(entity_nodes, statement_nodes, statement_entity_edges)
    MetadataExtractor-->>Orchestrator: user_statements

    alt has_user_statements
        Orchestrator->>CeleryBroker: enqueue extract_user_metadata_task(end_user_id, statements, config_id, language)
        CeleryBroker-->>CeleryWorker: dispatch extract_user_metadata_task

        CeleryWorker->>PG: get_db_context()
        CeleryWorker->>EndUserRepo: get_by_id(end_user_id)
        EndUserRepo-->>CeleryWorker: end_user
        CeleryWorker->>MemoryConfigService: get_config_with_fallback(memory_config_id, workspace_id)
        MemoryConfigService-->>CeleryWorker: memory_config
        CeleryWorker->>MemoryClientFactory: get_llm_client(llm_id)
        MemoryClientFactory-->>CeleryWorker: LLMClient

        CeleryWorker->>EndUserInfoRepo: get_by_end_user_id(end_user_id)
        EndUserInfoRepo-->>CeleryWorker: existing_info(meta_data, aliases)

        CeleryWorker->>MetadataExtractor: extract_metadata(statements, existing_metadata, existing_aliases)
        MetadataExtractor->>LLMClient: response_structured(prompt, MetadataExtractionResponse)
        LLMClient-->>MetadataExtractor: MetadataExtractionResponse(user_metadata, aliases_to_add, aliases_to_remove)
        MetadataExtractor-->>CeleryWorker: user_metadata, aliases_to_add, aliases_to_remove

        CeleryWorker->>CeleryWorker: clean_metadata(user_metadata)
        CeleryWorker->>CeleryWorker: _update_timestamps(existing_meta, cleaned_meta, updated_at, now)
        CeleryWorker->>PG: update EndUserInfo.meta_data, EndUserInfo.aliases, EndUser.other_name
        PG-->>CeleryWorker: commit

        CeleryWorker->>Neo4j: execute_query(MATCH ExtractedEntity user placeholders SET aliases)
        Neo4j-->>CeleryWorker: ok

        CeleryWorker-->>CeleryBroker: task result(status=SUCCESS)
        CeleryBroker-->>Orchestrator: async completion (logged only)
    else no_user_statements
        Orchestrator->>Orchestrator: skip metadata extraction
    end
Loading

Class diagram for metadata extraction and related models

classDiagram
    class MetadataExtractor {
        - llm_client
        - language : str
        + MetadataExtractor(llm_client, language : str)
        + detect_language(statements : List~str~) str
        + collect_user_related_statements(entity_nodes : List~ExtractedEntityNode~, statement_nodes : List~StatementNode~, statement_entity_edges : List~StatementEntityEdge~) List~str~
        + extract_metadata(statements : List~str~, existing_metadata : dict, existing_aliases : List~str~) tuple
    }

    class ExtractedEntityNode {
        + id : str
        + name : str
        + entity_type : str
        + end_user_id : str
    }

    class StatementNode {
        + id : str
        + statement : str
        + speaker : str
        + end_user_id : str
    }

    class StatementEntityEdge {
        + source : str
        + target : str
    }

    class UserMetadataProfile {
        + role : str
        + domain : str
        + expertise : List~str~
        + interests : List~str~
    }

    class UserMetadataBehavioralHints {
        + learning_stage : str
        + preferred_depth : str
        + tone_preference : str
    }

    class UserMetadata {
        + profile : UserMetadataProfile
        + behavioral_hints : UserMetadataBehavioralHints
        + knowledge_tags : List~str~
    }

    class MetadataExtractionResponse {
        + user_metadata : UserMetadata
        + aliases_to_add : List~str~
        + aliases_to_remove : List~str~
    }

    class ChunkNode {
        + id : str
        + dialog_id : str
        + content : str
        + speaker : str
        + chunk_embedding : List~float~
        + sequence_number : int
        + metadata : dict
    }

    MetadataExtractor --> UserMetadata : returns
    MetadataExtractionResponse --> UserMetadata : contains
    UserMetadata --> UserMetadataProfile : has
    UserMetadata --> UserMetadataBehavioralHints : has

    MetadataExtractor --> ExtractedEntityNode : reads
    MetadataExtractor --> StatementNode : reads
    MetadataExtractor --> StatementEntityEdge : reads

    StatementNode --> ChunkNode : derived_from

    class EndUserInfo {
        + end_user_id : str
        + other_name : str
        + aliases : List~str~
        + meta_data : dict
    }

    class EndUser {
        + id : str
        + workspace_id : str
        + other_name : str
    }

    EndUserInfo --> EndUser : references
    MetadataExtractor --> MetadataExtractionResponse : uses
    EndUserInfo --> UserMetadata : stores_in_meta_data
Loading

File-Level Changes

Change Details Files
Introduce LLM-based user metadata extraction models, prompt, and Celery task, and wire it into the extraction pipeline after graph deduplication.
  • Add new Pydantic models for user metadata and metadata extraction responses, including profile, behavioral hints, and knowledge tags.
  • Create a dedicated Jinja2 prompt template for extracting user metadata and alias deltas (add/remove) based on user statements and existing metadata/aliases.
  • Implement MetadataExtractor to collect user-related statements from graph nodes/edges and call the LLM to produce structured metadata and alias changes.
  • Add a new Celery task extract_user_metadata_task that loads LLM config with workspace fallback, calls MetadataExtractor, cleans and merges metadata, manages per-field _updated_at timestamps, updates end_user_info/end_user aliases and other_name, and synchronizes aliases back to Neo4j.
  • Register the new Celery task on the memory_tasks queue and trigger it from the extraction orchestrator in non-pilot runs using user-related statements collected post-dedup.
api/app/core/memory/models/metadata_models.py
api/app/core/memory/utils/prompt/prompts/extract_user_metadata.jinja2
api/app/core/memory/storage_services/extraction_engine/knowledge_extraction/metadata_extractor.py
api/app/tasks.py
api/app/celery_app.py
api/app/core/memory/storage_services/extraction_engine/extraction_orchestrator.py
api/app/core/memory/models/__init__.py
Normalize user/assistant handling in statement extraction and graph storage, adding speaker propagation and special handling of user entities and pronouns.
  • Extend ChunkNode, Cypher insert queries, and _create_nodes_and_edges to carry a speaker field from chunks into Neo4j Chunk and Statement nodes.
  • Adjust statement extraction to fetch speaker at the start of processing a chunk and use it consistently for created Statement objects.
  • Change statement extraction prompt to always use the generic subject 用户 (user) for user utterances and 助手/AI助手 for assistant utterances, with examples updated accordingly.
  • Preprocess entity_nodes before Neo4j save to reuse IDs for special entities like 用户/我/user/i and AI 助手/assistant per end_user, updating all affected edges to avoid duplicate special nodes.
api/app/core/memory/models/graph_models.py
api/app/repositories/neo4j/cypher_queries.py
api/app/core/memory/storage_services/extraction_engine/extraction_orchestrator.py
api/app/core/memory/storage_services/extraction_engine/knowledge_extraction/statement_extraction.py
api/app/core/memory/utils/prompt/prompts/extract_statement.jinja2
api/app/repositories/neo4j/graph_saver.py
Refine Neo4j keyword/perceptual search APIs to use escaped query parameters and a consistent query name, improving robustness with special characters.
  • Rename search_graph and search_perceptual parameter q to query across the search layer and update all call sites in hybrid search, keyword search, and perceptual retrieval nodes.
  • Update all Cypher full-text search queries to use parameter $query instead of $q and adjust alias ranking logic consistently.
  • Use escape_lucene_query to pre-escape user query strings in search_graph, search_graph_by_keyword_temporal, perceptual search, and entity name queries to prevent Lucene TokenMgrError (including '/' handling).
  • Enhance escape_lucene_query to escape the '/' character in addition to existing Lucene special tokens.
api/app/repositories/neo4j/graph_search.py
api/app/repositories/neo4j/cypher_queries.py
api/app/core/memory/src/search.py
api/app/core/memory/storage_services/search/keyword_search.py
api/app/core/memory/agent/langgraph_graph/nodes/perceptual_retrieve_node.py
api/app/core/memory/utils/data/text_utils.py
Adjust existing interest initialization and alias-sync logic to delegate alias updates to the new metadata task and add minor logging/result field tweaks.
  • Remove the database session wrapper around interest distribution cache initialization loop, keeping asynchronous cache operations and counters intact.
  • Add end_user_details to the incremental clustering task’s success payload for richer reporting.
  • Remove direct alias/meta_data creation in _update_end_user_other_name, since alias synchronization is now handled by the metadata extraction flow.
  • Perform minor logging/message cleanups (consistent string literals, simplified PERF logs, warning messages).
api/app/tasks.py
api/app/core/memory/storage_services/extraction_engine/extraction_orchestrator.py
api/app/core/memory/src/search.py
api/app/repositories/neo4j/graph_search.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - 我发现了 1 个问题,并留下了一些整体反馈:

  • extract_user_metadata_task.clean_metadata 中,字典推导式会对同一个值多次调用 clean_metadata(v)(既在值表达式中,又在过滤条件中),这样既低效又容易出错;建议为每个键先计算一次 cleaned = clean_metadata(v),然后在两个位置复用这个结果。
  • MetadataExtractor.extract_metadata 中,提示语的语言完全由 detect_language(statements) 决定,而忽略了传入 extractor 的 language 参数;如果调用方显式配置了 language,你可能需要尊重这个参数(或者至少清晰地说明优先级),以避免产生意料之外的行为。
给 AI Agent 的提示
请根据这次代码审查中的评论进行修改:

## 整体评论
-`extract_user_metadata_task.clean_metadata` 中,字典推导式会对同一个值多次调用 `clean_metadata(v)`(既在值表达式中,又在过滤条件中),这样既低效又容易出错;建议为每个键先计算一次 `cleaned = clean_metadata(v)`,然后在两个位置复用这个结果。
-`MetadataExtractor.extract_metadata` 中,提示语的语言完全由 `detect_language(statements)` 决定,而忽略了传入 extractor 的 `language` 参数;如果调用方显式配置了 `language`,你可能需要尊重这个参数(或者至少清晰地说明优先级),以避免产生意料之外的行为。

## 单条评论

### 评论 1
<location path="api/app/core/memory/storage_services/extraction_engine/knowledge_extraction/metadata_extractor.py" line_range="34-43" />
<code_context>
+    def __init__(self, llm_client, language: str = "zh"):
</code_context>
<issue_to_address>
**question:** 请澄清或统一构造函数参数 `language` 与动态语言检测逻辑之间的关系。

`__init__` 会存储 `language`,但 `extract_metadata` 始终使用 `detect_language(statements)` 而不是 `self.language`,因此构造函数参数实际上被忽略了。建议在设置了 `self.language` 时优先使用它(仅在其为 `None` 时才自动检测),或者如果行为应始终自动检测,则移除 `language` 参数,从而让 API 对调用方来说不那么令人困惑。
</issue_to_address>

Sourcery 对开源项目是免费的——如果你觉得我们的 Review 有帮助,欢迎分享 ✨
帮我变得更有用!请在每条评论上点 👍 或 👎,我会根据你的反馈改进后续的 Review。
Original comment in English

Hey - I've found 1 issue, and left some high level feedback:

  • In extract_user_metadata_task.clean_metadata, the dict comprehension calls clean_metadata(v) multiple times for the same value (both in the value expression and the filter), which is inefficient and can be error‑prone; consider computing cleaned = clean_metadata(v) once per key and reusing it in both places.
  • In MetadataExtractor.extract_metadata, the prompt language is determined solely by detect_language(statements) and ignores the language passed into the extractor; if the caller explicitly configures language, you may want to respect that (or at least make the precedence clear) to avoid surprising behavior.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `extract_user_metadata_task.clean_metadata`, the dict comprehension calls `clean_metadata(v)` multiple times for the same value (both in the value expression and the filter), which is inefficient and can be error‑prone; consider computing `cleaned = clean_metadata(v)` once per key and reusing it in both places.
- In `MetadataExtractor.extract_metadata`, the prompt language is determined solely by `detect_language(statements)` and ignores the `language` passed into the extractor; if the caller explicitly configures `language`, you may want to respect that (or at least make the precedence clear) to avoid surprising behavior.

## Individual Comments

### Comment 1
<location path="api/app/core/memory/storage_services/extraction_engine/knowledge_extraction/metadata_extractor.py" line_range="34-43" />
<code_context>
+    def __init__(self, llm_client, language: str = "zh"):
</code_context>
<issue_to_address>
**question:** Clarify or align the `language` constructor argument with the dynamic language detection logic.

`__init__` stores `language`, but `extract_metadata` always uses `detect_language(statements)` instead of `self.language`, so the constructor argument is effectively ignored. Consider either using `self.language` when set (and only auto-detecting when it is `None`) or removing the `language` parameter if behavior should always be auto-detected, so the API is less confusing for callers.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

lanceyq added 2 commits April 10, 2026 00:42
…ogic

- Make MetadataExtractor language param optional (default None) to
  support auto-detection fallback when no language is explicitly set
- Refactor clean_metadata from walrus-operator dict comprehension to
  explicit loop for correctness and readability
@keeees keeees merged commit 58d18b4 into develop Apr 10, 2026
1 check passed
@lanceyq lanceyq deleted the feat/extract-metadata branch April 17, 2026 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants