Skip to content

yuancafe/rag-builder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RAG Builder Skill

Author: Leo Yuan Tsao

English | δΈ­ζ–‡

An advanced AI skill for building production-ready RAG (Retrieval-Augmented Generation) knowledge bases with entity deduplication, domain-adaptive entity types, self-learning type system, and interactive knowledge graph visualization.

Built on top of LightRAG.


✨ Features

πŸ”„ Entity Deduplication

Automatically detect and merge duplicate entities using multiple strategies:

  • Substring matching β€” "Dante" + "Dante Alighieri" β†’ merged
  • Levenshtein similarity β€” "Virgil" + "Virgilius" β†’ merged
  • Context-based disambiguation β€” compare entity descriptions
  • Three merge strategies: conservative, smart (default), aggressive

🎯 Domain-Adaptive Entity Types

Automatically detect document domain and generate appropriate entity types:

Domain Example Types
Literature Character, Location, Event, Literary_Work, Symbol, Creature
Medicine Disease, Symptom, Treatment, Drug, Anatomy, Biomarker
Law Case, Court, Judge, Party, Statute, Precedent
Business Company, Product, Technology, Market, Strategy, Metric
Science Theory, Experiment, Researcher, Method, Finding, Hypothesis
History Person, Event, Location, Organization, Time_Period, Artifact
Technology Technology, Product, Algorithm, Protocol, Platform, Tool

Custom domains are auto-detected via LLM analysis.

🧠 Self-Learning Type System

The system learns from every build:

  1. Captures entity types returned by the LLM
  2. Normalizes type names (book/work β†’ Book_Work)
  3. Tracks usage frequency
  4. Auto-merges high-frequency types into templates
  5. Optimizes type lists for future builds

πŸ“Š Interactive Visualization

Auto-generated interactive knowledge graph viewer with:

  • Entity type color coding β€” each type has a distinct color
  • Click-to-inspect β€” click any node to see description, type, and connections
  • Type filter bar β€” filter nodes by entity type
  • Collapsible panels β€” left sidebar and right detail panel can be hidden
  • Search β€” find entities by name
  • Drag & zoom β€” rearrange and explore the graph

πŸ“ Universal File Support

Extract text from 40+ file formats:

Format Extensions Dependency
PDF .pdf PyPDF2
Word .docx python-docx
PowerPoint .pptx python-pptx
Excel .xlsx .xls openpyxl
EPUB .epub ebooklib
MOBI / AZW3 .mobi .azw3 Calibre or mobi
Plain text .txt .md .json .csv .yaml .xml ... β€”
Code .py .js .ts .java .go .rs .cpp ... β€”

All dependencies are auto-installed on first use.


πŸš€ Installation

For Claude Code / Anthropic CLI

# Clone the repo
git clone https://github.com/yuancafe/rag-builder.git

# Copy skill to Claude's skill directory
cp -r rag-builder/skills/rag-builder ~/.claude/skills/

For StepFun / 小跃

cp -r rag-builder/skills/rag-builder ~/.stepfun/skills/

For Other AI Agents

cp -r rag-builder/skills/rag-builder ~/.agents/skills/

Prerequisites

pip install lightrag-hku networkx python-Levenshtein

Optional (auto-installed when needed):

pip install PyPDF2 python-docx python-pptx openpyxl ebooklib beautifulsoup4

πŸ“– Usage

Quick Start (Python)

import os
import asyncio
import sys

sys.path.insert(0, os.path.expanduser("~/.stepfun/skills/rag-builder/scripts"))
from rag_builder_advanced import RAGBuilder

# Set your API key
os.environ["OPENAI_API_KEY"] = "your-api-key"
os.environ["OPENAI_BASE_URL"] = "https://api.openai.com/v1"  # or any OpenAI-compatible endpoint

async def main():
    builder = RAGBuilder(
        files=["document.pdf", "notes.md"],
        output_dir="./my_rag",
        auto_detect_domain=True,
        deduplicate_entities=True,
        generate_visualization=True,
    )
    await builder.build()

asyncio.run(main())

CLI Usage

# Build RAG
python3 rag_builder_advanced.py document.pdf -o ./my_rag --auto-detect

# Detect domain & generate entity types
python3 domain_detector.py document.pdf -n 12 -o domain.json

# Analyze duplicate entities
python3 entity_deduplicator.py ./my_rag/rag_database -t 0.85

# Generate visualization for existing RAG
python3 generate_visualization.py ./my_rag/rag_database -o ./my_rag/visualization

# Extract text from any file
python3 extract_text.py document.epub

Configuration Options

RAGBuilder(
    files=["doc1.pdf", "doc2.docx"],   # Input files (any supported format)
    output_dir="./my_rag",              # Output directory

    # Domain & Entity Types
    auto_detect_domain=True,            # Auto-detect domain via LLM
    custom_entity_types=None,           # Or specify: ["Person", "Location", ...]

    # Deduplication
    deduplicate_entities=True,          # Enable entity merging
    similarity_threshold=0.85,          # Similarity threshold (0.0–1.0)
    merge_strategy="smart",             # "conservative" | "smart" | "aggressive"

    # Visualization
    generate_visualization=True,        # Generate interactive viewer

    # API (any OpenAI-compatible endpoint)
    llm_model="gpt-4o-mini",
    embedding_model="text-embedding-3-small",
    api_key=os.environ["OPENAI_API_KEY"],
    base_url="https://api.openai.com/v1",
)

πŸ“‚ Output Structure

my_rag/
β”œβ”€β”€ rag_database/                    # LightRAG knowledge base
β”‚   β”œβ”€β”€ vdb_entities.json            # Entity vectors
β”‚   β”œβ”€β”€ vdb_relationships.json       # Relationship vectors
β”‚   β”œβ”€β”€ vdb_chunks.json              # Text chunk vectors
β”‚   β”œβ”€β”€ graph_chunk_entity_relation.graphml  # Knowledge graph
β”‚   └── kv_store_*.json              # Key-value stores
β”œβ”€β”€ visualization/                   # Interactive viewer
β”‚   β”œβ”€β”€ index.html                   # Open this in browser
β”‚   β”œβ”€β”€ graph_data.json              # Graph data
β”‚   └── entity_details.json          # Entity descriptions
β”œβ”€β”€ reports/
β”‚   β”œβ”€β”€ deduplication_log.json       # Entity merge history
β”‚   β”œβ”€β”€ learning_log.json            # Type learning results
β”‚   └── entity_analysis.json         # Entity statistics
└── config.json                      # Build configuration

Viewing the Visualization

cd my_rag/visualization
python3 -m http.server 8888
# Open http://localhost:8888/index.html

🧠 Self-Learning System

The skill maintains a template file (entity_type_templates.json) that improves over time:

Build 1: 8 types (predefined) β†’ LLM returns new types β†’ learns +3
Build 2: 11 types β†’ LLM returns more β†’ learns +2
Build 5: 15 types β†’ highly optimized for your domain

Managing Learned Types

# View current templates
python3 entity_type_learner.py show --domain literature

# Get optimized type list
python3 entity_type_learner.py optimize --domain medicine

# Learn from build logs
python3 entity_type_learner.py learn --log build.log --domain science --auto-merge

πŸ”§ Supported LLM Providers

Any OpenAI-compatible API works:

Provider Base URL Models
OpenAI https://api.openai.com/v1 gpt-4o-mini, gpt-4o
NVIDIA https://integrate.api.nvidia.com/v1 meta/llama-3.1-8b-instruct
Ollama http://localhost:11434/v1 llama2, mistral
Azure https://<endpoint>.openai.azure.com/ gpt-4o
Together AI https://api.together.xyz/v1 meta-llama/Llama-3-8b
export OPENAI_API_KEY="your-key"
export OPENAI_BASE_URL="https://api.openai.com/v1"

πŸ“ File Structure

skills/rag-builder/
β”œβ”€β”€ SKILL.md                         # Skill definition (for AI agents)
β”œβ”€β”€ ENTITY_TYPE_LEARNING.md          # Self-learning documentation
β”œβ”€β”€ entity_type_templates.json       # Domain templates (auto-updated)
└── scripts/
    β”œβ”€β”€ rag_builder_advanced.py      # Main builder (RAGBuilder class)
    β”œβ”€β”€ generate_visualization.py    # Interactive visualization generator
    β”œβ”€β”€ domain_detector.py           # Domain detection & type generation
    β”œβ”€β”€ entity_deduplicator.py       # Entity deduplication tool
    β”œβ”€β”€ entity_type_learner.py       # Self-learning type system
    β”œβ”€β”€ extract_text.py              # Universal text extractor
    └── rag_builder.py               # Simple builder (legacy)

πŸ“„ License

MIT License


πŸ‘€ Author

Leo Yuan Tsao β€” GitHub


πŸ™ Acknowledgments

About

Advanced RAG Builder Skill with entity deduplication, domain-adaptive types, self-learning, and interactive visualization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages