Skip to content

Sanitize user-controlled inputs in XML-structured LLM prompts to prevent injection attacks #11

@chigwell

Description

@chigwell

User Story
As a security-conscious developer,
I want to sanitize user-controlled inputs in prompt formatting
so that malicious XML tags can't disrupt LLM response parsing.

Background
The current user_prompt.format() in eknowledge/main.py directly inserts raw text into XML-structured LLM prompts. This allows injection of fake <node> entries through inputs containing XML syntax (e.g., "<node><from_node>HACK</from_node>"). The vulnerability exists in:

# main.py line 92:
HumanMessage(content=user_prompt.format(text=chunk, relationships=relations))

Attackers could manipulate knowledge graph outputs by poisoning text inputs with XML tags, potentially creating虚假 relationships or disrupting parsing logic.

Acceptance Criteria

  • Modify execute_graph_generation in eknowledge/main.py to sanitize text inputs
  • Replace special XML characters (<, >, &) with entities (&lt;, &gt;, &amp;) before string formatting
  • Add test case in tests/test_eknowledge.py that verifies:
    • Inputs containing <node>TEST</node> get converted to &lt;node&gt;TEST&lt;/node&gt; in prompts
    • LLM receives sanitized text that doesn't create unintended XML nodes
  • Ensure verbose mode logs show original vs sanitized text when enabled
  • Maintain existing chunk processing performance (add benchmark assertion if missing)

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions