Skip to content

feat(knowledge): implement /kb import pipeline with multi-format parsing#2484

Open
lukangyu wants to merge 47 commits intoagentscope-ai:mainfrom
lukangyu:feat/console-kb-import
Open

feat(knowledge): implement /kb import pipeline with multi-format parsing#2484
lukangyu wants to merge 47 commits intoagentscope-ai:mainfrom
lukangyu:feat/console-kb-import

Conversation

@lukangyu
Copy link
Copy Markdown

Description

This PR implements the knowledge-base import workflow with a command-driven UX and multi-format parsing pipeline for Task 18 in the roadmap issue: #2291.
Key updates:

  • Add /kb and /kb import command handling in runner command dispatch.
  • Add backend knowledge import/search modules and API router integration.
  • Add console-side integration for KB import from chat flow.
  • Add parser/pipeline support for txt, md, pdf, docx, doc, pptx, xlsx.
  • Align Docling suffix handling with officially supported format coverage.
  • Update EN/ZH docs for KB import usage and supported formats.
  • Add/restore unit tests for parser dispatch, import pipeline, router, command dispatch, and knowledge search tool.

Related Issue: Fixes #2396; Relates to #2291

Security Considerations:
No new auth model or secret handling introduced. Changes are scoped to import flow, parser dispatch, and existing API/channel boundaries.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation
  • Refactoring

Component(s) Affected

  • Core / Backend (app, agents, config, providers, utils, local_models)
  • Console (frontend web UI)
  • Channels (DingTalk, Feishu, QQ, Discord, iMessage, etc.)
  • Skills
  • CLI
  • Documentation (website)
  • Tests
  • CI/CD
  • Scripts / Deploy

Checklist

  • I ran pre-commit run --all-files locally and it passes
  • If pre-commit auto-fixed files, I committed those changes and reran checks
  • I ran tests locally (pytest or as relevant) and they pass
  • Documentation updated (if needed)
  • Ready for review

Testing

  1. Run static and style checks:
    • uv run pre-commit run --all-files
  2. Run KB-related unit tests:
    • uv run pytest tests/unit/knowledge tests/unit/agents/test_knowledge_pipeline.py tests/unit/agents/tools/test_knowledge_search_tool.py tests/unit/runner/test_command_dispatch_kb.py
  3. Manual verification:
    • Start project and import files via /kb import.
    • Validate successful import and searchable KB responses across supported formats.

Local Verification Evidence

uv run pre-commit run --all-files
# Result: all hooks passed (mypy/black/flake8/pylint/prettier, etc.)

uv run pytest tests/unit/knowledge tests/unit/agents/test_knowledge_pipeline.py tests/unit/agents/tools/test_knowledge_search_tool.py tests/unit/runner/test_command_dispatch_kb.py
# Result: 45 passed       

Additional Notes

lukangyu added 30 commits March 29, 2026 17:34
Copilot AI review requested due to automatic review settings March 29, 2026 09:53
@github-actions github-actions bot added the first-time-contributor PR created by a first time contributor label Mar 29, 2026
@github-actions
Copy link
Copy Markdown

Welcome to CoPaw! 🐾

Hi @lukangyu, thank you for your first Pull Request! 🎉

🙌 Join Developer Community

Thanks so much for your contribution! We'd love to invite you to join the official CoPaw developer group! You can find the Discord and DingTalk group links under the "Developer Community" section on our docs page:
https://copaw.agentscope.io/docs/community

We truly appreciate your enthusiasm—and look forward to your future contributions! 😊

We'll review your PR soon.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive Knowledge Base (KB) system, allowing users to import and search documents across various formats including PDF, Office, and Markdown. Key additions include a multi-engine parsing pipeline with optional Docling support, a filesystem-based repository for document management, and a lexical search tool integrated into the agent's reasoning loop. Feedback focuses on improving system observability by logging suppressed exceptions in the repository and service layers. Recommendations were also made to consolidate duplicate data models and refine the file-type metadata mapping to ensure more accurate document processing.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements an end-to-end Knowledge Base (KB) import + search workflow, including /kb command handling, backend import/search services, multi-format parsing, API endpoints, and supporting docs/tests.

Changes:

  • Add /kb command dispatch path to import current-message attachments into a workspace-local KB.
  • Introduce knowledge import/search backend modules (repository, parsers, services) plus /knowledge/* API routes and an agent knowledge_search tool.
  • Update Console command suggestions/i18n and docs (EN/ZH), and add unit tests covering dispatch/parsers/import/search/router.

Reviewed changes

Copilot reviewed 53 out of 53 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
website/public/docs/console.zh.md Document KB import usage and supported formats (ZH).
website/public/docs/console.en.md Document KB import usage and supported formats (EN).
tests/unit/runner/test_command_dispatch_kb.py Add unit tests for /kb command dispatch behavior.
tests/unit/knowledge/test_xlsx_parser.py Add XLSX parser unit tests.
tests/unit/knowledge/test_search_service.py Add knowledge search service unit tests.
tests/unit/knowledge/test_pptx_parser.py Add PPTX parser unit tests.
tests/unit/knowledge/test_parser_dispatch.py Add tests for parser resolution + Docling fallback strategy.
tests/unit/knowledge/test_knowledge_router.py Add FastAPI router tests for /knowledge/search.
tests/unit/knowledge/test_import_service_xlsx.py Add XLSX import pipeline test coverage.
tests/unit/knowledge/test_import_service_pptx.py Add PPTX import pipeline test coverage.
tests/unit/knowledge/test_import_service_parser_fallback.py Add import fallback test coverage when a parser fails.
tests/unit/knowledge/test_import_service_local_files.py Add local-file import behavior tests (duplicates/unsupported/missing).
tests/unit/knowledge/test_import_service_docx.py Add DOCX + DOC import tests (including soffice conversion path via monkeypatch).
tests/unit/knowledge/test_docx_parser.py Add DOCX parser unit tests.
tests/unit/knowledge/test_docling_parser.py Add Docling parser tests + supported suffix assertion.
tests/unit/knowledge/test_doc_parser.py Add DOC(soffice) conversion bridge tests.
tests/unit/agents/tools/test_knowledge_search_tool.py Add tests for the knowledge_search tool wrapper.
tests/unit/agents/test_knowledge_pipeline.py Add tests for normalization + chunking behavior.
src/copaw/app/runner/command_dispatch.py Add /kb command detection and attachment-to-import wiring.
src/copaw/app/routers/knowledge.py Add /knowledge/import, /knowledge/search, /knowledge/documents endpoints.
src/copaw/app/routers/console.py Extend upload response shape (adds upload_id, stored_name, size).
src/copaw/app/routers/agent_scoped.py Mount knowledge router under agent-scoped API.
src/copaw/app/routers/init.py Mount knowledge router in the main API router.
src/copaw/agents/tools/knowledge_search.py Implement workspace-bound knowledge_search tool.
src/copaw/agents/tools/init.py Export create_knowledge_search_tool.
src/copaw/agents/react_agent.py Register knowledge_search tool in agent toolkit setup.
src/copaw/agents/md_files/zh/AGENTS.md Prompt guidance: use knowledge_search for KB-grounded answers (ZH).
src/copaw/agents/md_files/en/AGENTS.md Prompt guidance: use knowledge_search for KB-grounded answers (EN).
src/copaw/agents/knowledge/service.py Implement import orchestration, dedupe, parsing fallback, persistence.
src/copaw/agents/knowledge/search_service.py Implement lightweight lexical KB search + “listing query” fallback.
src/copaw/agents/knowledge/repository.py Implement KB workspace layout, persistence, and listing helpers.
src/copaw/agents/knowledge/parsers/xlsx_parser.py Add XLSX parsing via openpyxl.
src/copaw/agents/knowledge/parsers/text_parser.py Add TXT parsing.
src/copaw/agents/knowledge/parsers/pptx_parser.py Add PPTX parsing via python-pptx.
src/copaw/agents/knowledge/parsers/pdf_parser.py Add PDF parsing via pypdf.
src/copaw/agents/knowledge/parsers/markdown_parser.py Add Markdown parsing and title extraction.
src/copaw/agents/knowledge/parsers/docx_parser.py Add DOCX parsing via python-docx with table extraction.
src/copaw/agents/knowledge/parsers/docling_parser.py Add optional Docling-backed parser + suffix mapping.
src/copaw/agents/knowledge/parsers/doc_parser.py Add DOC → DOCX conversion via LibreOffice soffice.
src/copaw/agents/knowledge/parsers/base.py Add parser protocol + dispatch logic + engine selection env var.
src/copaw/agents/knowledge/parsers/init.py Export parser registry and helpers.
src/copaw/agents/knowledge/normalizer.py Add normalization (surrogate stripping, line cleanup).
src/copaw/agents/knowledge/models.py Add Pydantic request/response models for import/search + ParsedDocument.
src/copaw/agents/knowledge/exceptions.py Add knowledge-domain exception hierarchy.
src/copaw/agents/knowledge/chunker.py Add chunking utility for KB documents.
src/copaw/agents/knowledge/init.py Export knowledge service entry point.
pyproject.toml Add required parsing dependencies + docling optional extra.
console/src/pages/Chat/index.tsx Add /kb to command suggestions (frontend).
console/src/locales/zh.json Add i18n strings for /kb and KB import UI text (ZH).
console/src/locales/en.json Add i18n strings for /kb and KB import UI text (EN).
console/src/api/modules/chat.ts Update upload response typing (adds upload_id, optional size).
README_zh.md Add KB import usage and Docling engine note (ZH).
README.md Add KB import usage and Docling engine setup instructions (EN).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 29, 2026 10:04
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
lukangyu and others added 2 commits March 29, 2026 18:05
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 53 out of 53 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 29, 2026 16:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 54 out of 54 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

first-time-contributor PR created by a first time contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Support importing current chat attachments into knowledge base in Console

2 participants