Skip to content

[Feature]: Improve knowledge base chunking strategy (overlap + boundary-aware splitting) #277

@TanmayZade

Description

@TanmayZade

Problem / motivation

The KB ingest (POST /kb/ingest) uses a fixed 800-character chunker with no overlap and no awareness of sentence/paragraph boundaries. This can split mid-sentence, degrading to_tsvector full-text search relevance because PostgreSQL stems incomplete fragments.

Proposed solution

Add a configurable chunk overlap (e.g., 100–200 chars) so context spans chunk boundaries

Split on paragraph/sentence boundaries instead of hard character offsets (recursive chunking)

Future: Consider adding pgvector or leveraging the existing Qdrant instance for semantic vector search on KB documents (the services/threatintel service already uses Qdrant + BAAI/bge-small-en-v1.5)

Alternatives considered

No response

Component area

Other

Checklist

  • I have searched existing issues and this is not a duplicate
  • This feature aligns with the AiSOC roadmap or is a reasonable addition

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions