feat(knowledge): implement /kb import pipeline with multi-format parsing#2484
feat(knowledge): implement /kb import pipeline with multi-format parsing#2484lukangyu wants to merge 47 commits intoagentscope-ai:mainfrom
Conversation
|
Hi @lukangyu, thank you for your first Pull Request! 🎉 🙌 Join Developer CommunityThanks so much for your contribution! We'd love to invite you to join the official CoPaw developer group! You can find the Discord and DingTalk group links under the "Developer Community" section on our docs page: We truly appreciate your enthusiasm—and look forward to your future contributions! 😊 We'll review your PR soon. |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive Knowledge Base (KB) system, allowing users to import and search documents across various formats including PDF, Office, and Markdown. Key additions include a multi-engine parsing pipeline with optional Docling support, a filesystem-based repository for document management, and a lexical search tool integrated into the agent's reasoning loop. Feedback focuses on improving system observability by logging suppressed exceptions in the repository and service layers. Recommendations were also made to consolidate duplicate data models and refine the file-type metadata mapping to ensure more accurate document processing.
There was a problem hiding this comment.
Pull request overview
Implements an end-to-end Knowledge Base (KB) import + search workflow, including /kb command handling, backend import/search services, multi-format parsing, API endpoints, and supporting docs/tests.
Changes:
- Add
/kbcommand dispatch path to import current-message attachments into a workspace-local KB. - Introduce knowledge import/search backend modules (repository, parsers, services) plus
/knowledge/*API routes and an agentknowledge_searchtool. - Update Console command suggestions/i18n and docs (EN/ZH), and add unit tests covering dispatch/parsers/import/search/router.
Reviewed changes
Copilot reviewed 53 out of 53 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| website/public/docs/console.zh.md | Document KB import usage and supported formats (ZH). |
| website/public/docs/console.en.md | Document KB import usage and supported formats (EN). |
| tests/unit/runner/test_command_dispatch_kb.py | Add unit tests for /kb command dispatch behavior. |
| tests/unit/knowledge/test_xlsx_parser.py | Add XLSX parser unit tests. |
| tests/unit/knowledge/test_search_service.py | Add knowledge search service unit tests. |
| tests/unit/knowledge/test_pptx_parser.py | Add PPTX parser unit tests. |
| tests/unit/knowledge/test_parser_dispatch.py | Add tests for parser resolution + Docling fallback strategy. |
| tests/unit/knowledge/test_knowledge_router.py | Add FastAPI router tests for /knowledge/search. |
| tests/unit/knowledge/test_import_service_xlsx.py | Add XLSX import pipeline test coverage. |
| tests/unit/knowledge/test_import_service_pptx.py | Add PPTX import pipeline test coverage. |
| tests/unit/knowledge/test_import_service_parser_fallback.py | Add import fallback test coverage when a parser fails. |
| tests/unit/knowledge/test_import_service_local_files.py | Add local-file import behavior tests (duplicates/unsupported/missing). |
| tests/unit/knowledge/test_import_service_docx.py | Add DOCX + DOC import tests (including soffice conversion path via monkeypatch). |
| tests/unit/knowledge/test_docx_parser.py | Add DOCX parser unit tests. |
| tests/unit/knowledge/test_docling_parser.py | Add Docling parser tests + supported suffix assertion. |
| tests/unit/knowledge/test_doc_parser.py | Add DOC(soffice) conversion bridge tests. |
| tests/unit/agents/tools/test_knowledge_search_tool.py | Add tests for the knowledge_search tool wrapper. |
| tests/unit/agents/test_knowledge_pipeline.py | Add tests for normalization + chunking behavior. |
| src/copaw/app/runner/command_dispatch.py | Add /kb command detection and attachment-to-import wiring. |
| src/copaw/app/routers/knowledge.py | Add /knowledge/import, /knowledge/search, /knowledge/documents endpoints. |
| src/copaw/app/routers/console.py | Extend upload response shape (adds upload_id, stored_name, size). |
| src/copaw/app/routers/agent_scoped.py | Mount knowledge router under agent-scoped API. |
| src/copaw/app/routers/init.py | Mount knowledge router in the main API router. |
| src/copaw/agents/tools/knowledge_search.py | Implement workspace-bound knowledge_search tool. |
| src/copaw/agents/tools/init.py | Export create_knowledge_search_tool. |
| src/copaw/agents/react_agent.py | Register knowledge_search tool in agent toolkit setup. |
| src/copaw/agents/md_files/zh/AGENTS.md | Prompt guidance: use knowledge_search for KB-grounded answers (ZH). |
| src/copaw/agents/md_files/en/AGENTS.md | Prompt guidance: use knowledge_search for KB-grounded answers (EN). |
| src/copaw/agents/knowledge/service.py | Implement import orchestration, dedupe, parsing fallback, persistence. |
| src/copaw/agents/knowledge/search_service.py | Implement lightweight lexical KB search + “listing query” fallback. |
| src/copaw/agents/knowledge/repository.py | Implement KB workspace layout, persistence, and listing helpers. |
| src/copaw/agents/knowledge/parsers/xlsx_parser.py | Add XLSX parsing via openpyxl. |
| src/copaw/agents/knowledge/parsers/text_parser.py | Add TXT parsing. |
| src/copaw/agents/knowledge/parsers/pptx_parser.py | Add PPTX parsing via python-pptx. |
| src/copaw/agents/knowledge/parsers/pdf_parser.py | Add PDF parsing via pypdf. |
| src/copaw/agents/knowledge/parsers/markdown_parser.py | Add Markdown parsing and title extraction. |
| src/copaw/agents/knowledge/parsers/docx_parser.py | Add DOCX parsing via python-docx with table extraction. |
| src/copaw/agents/knowledge/parsers/docling_parser.py | Add optional Docling-backed parser + suffix mapping. |
| src/copaw/agents/knowledge/parsers/doc_parser.py | Add DOC → DOCX conversion via LibreOffice soffice. |
| src/copaw/agents/knowledge/parsers/base.py | Add parser protocol + dispatch logic + engine selection env var. |
| src/copaw/agents/knowledge/parsers/init.py | Export parser registry and helpers. |
| src/copaw/agents/knowledge/normalizer.py | Add normalization (surrogate stripping, line cleanup). |
| src/copaw/agents/knowledge/models.py | Add Pydantic request/response models for import/search + ParsedDocument. |
| src/copaw/agents/knowledge/exceptions.py | Add knowledge-domain exception hierarchy. |
| src/copaw/agents/knowledge/chunker.py | Add chunking utility for KB documents. |
| src/copaw/agents/knowledge/init.py | Export knowledge service entry point. |
| pyproject.toml | Add required parsing dependencies + docling optional extra. |
| console/src/pages/Chat/index.tsx | Add /kb to command suggestions (frontend). |
| console/src/locales/zh.json | Add i18n strings for /kb and KB import UI text (ZH). |
| console/src/locales/en.json | Add i18n strings for /kb and KB import UI text (EN). |
| console/src/api/modules/chat.ts | Update upload response typing (adds upload_id, optional size). |
| README_zh.md | Add KB import usage and Docling engine note (ZH). |
| README.md | Add KB import usage and Docling engine setup instructions (EN). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 53 out of 53 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 54 out of 54 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Description
This PR implements the knowledge-base import workflow with a command-driven UX and multi-format parsing pipeline for Task 18 in the roadmap issue: #2291.
Key updates:
/kband/kb importcommand handling in runner command dispatch.txt,md,pdf,docx,doc,pptx,xlsx.Related Issue: Fixes #2396; Relates to #2291
Security Considerations:
No new auth model or secret handling introduced. Changes are scoped to import flow, parser dispatch, and existing API/channel boundaries.
Type of Change
Component(s) Affected
Checklist
pre-commit run --all-fileslocally and it passespytestor as relevant) and they passTesting
uv run pre-commit run --all-filesuv run pytest tests/unit/knowledge tests/unit/agents/test_knowledge_pipeline.py tests/unit/agents/tools/test_knowledge_search_tool.py tests/unit/runner/test_command_dispatch_kb.py/kb import.Local Verification Evidence
Additional Notes