RAG Builder Skill

Author: Leo Yuan Tsao

English | 中文

An advanced AI skill for building production-ready RAG (Retrieval-Augmented Generation) knowledge bases with entity deduplication, domain-adaptive entity types, self-learning type system, and interactive knowledge graph visualization.

Built on top of LightRAG.

✨ Features

🔄 Entity Deduplication

Automatically detect and merge duplicate entities using multiple strategies:

Substring matching — "Dante" + "Dante Alighieri" → merged
Levenshtein similarity — "Virgil" + "Virgilius" → merged
Context-based disambiguation — compare entity descriptions
Three merge strategies: conservative, smart (default), aggressive

🎯 Domain-Adaptive Entity Types

Automatically detect document domain and generate appropriate entity types:

Domain	Example Types
Literature	Character, Location, Event, Literary_Work, Symbol, Creature
Medicine	Disease, Symptom, Treatment, Drug, Anatomy, Biomarker
Law	Case, Court, Judge, Party, Statute, Precedent
Business	Company, Product, Technology, Market, Strategy, Metric
Science	Theory, Experiment, Researcher, Method, Finding, Hypothesis
History	Person, Event, Location, Organization, Time_Period, Artifact
Technology	Technology, Product, Algorithm, Protocol, Platform, Tool

Custom domains are auto-detected via LLM analysis.

🧠 Self-Learning Type System

The system learns from every build:

Captures entity types returned by the LLM
Normalizes type names (book/work → Book_Work)
Tracks usage frequency
Auto-merges high-frequency types into templates
Optimizes type lists for future builds

📊 Interactive Visualization

Auto-generated interactive knowledge graph viewer with:

Entity type color coding — each type has a distinct color
Click-to-inspect — click any node to see description, type, and connections
Type filter bar — filter nodes by entity type
Collapsible panels — left sidebar and right detail panel can be hidden
Search — find entities by name
Drag & zoom — rearrange and explore the graph

📁 Universal File Support

Extract text from 40+ file formats:

Format	Extensions	Dependency
PDF	`.pdf`	PyPDF2
Word	`.docx`	python-docx
PowerPoint	`.pptx`	python-pptx
Excel	`.xlsx` `.xls`	openpyxl
EPUB	`.epub`	ebooklib
MOBI / AZW3	`.mobi` `.azw3`	Calibre or mobi
Plain text	`.txt` `.md` `.json` `.csv` `.yaml` `.xml` ...	—
Code	`.py` `.js` `.ts` `.java` `.go` `.rs` `.cpp` ...	—

All dependencies are auto-installed on first use.

🚀 Installation

For Claude Code / Anthropic CLI

# Clone the repo
git clone https://github.com/yuancafe/rag-builder.git

# Copy skill to Claude's skill directory
cp -r rag-builder/skills/rag-builder ~/.claude/skills/

For StepFun / 小跃

cp -r rag-builder/skills/rag-builder ~/.stepfun/skills/

For Other AI Agents

cp -r rag-builder/skills/rag-builder ~/.agents/skills/

Prerequisites

pip install lightrag-hku networkx python-Levenshtein

Optional (auto-installed when needed):

pip install PyPDF2 python-docx python-pptx openpyxl ebooklib beautifulsoup4

📖 Usage

Quick Start (Python)

import os
import asyncio
import sys

sys.path.insert(0, os.path.expanduser("~/.stepfun/skills/rag-builder/scripts"))
from rag_builder_advanced import RAGBuilder

# Set your API key
os.environ["OPENAI_API_KEY"] = "your-api-key"
os.environ["OPENAI_BASE_URL"] = "https://api.openai.com/v1"  # or any OpenAI-compatible endpoint

async def main():
    builder = RAGBuilder(
        files=["document.pdf", "notes.md"],
        output_dir="./my_rag",
        auto_detect_domain=True,
        deduplicate_entities=True,
        generate_visualization=True,
    )
    await builder.build()

asyncio.run(main())

CLI Usage

# Build RAG
python3 rag_builder_advanced.py document.pdf -o ./my_rag --auto-detect

# Detect domain & generate entity types
python3 domain_detector.py document.pdf -n 12 -o domain.json

# Analyze duplicate entities
python3 entity_deduplicator.py ./my_rag/rag_database -t 0.85

# Generate visualization for existing RAG
python3 generate_visualization.py ./my_rag/rag_database -o ./my_rag/visualization

# Extract text from any file
python3 extract_text.py document.epub

Configuration Options

RAGBuilder(
    files=["doc1.pdf", "doc2.docx"],   # Input files (any supported format)
    output_dir="./my_rag",              # Output directory

    # Domain & Entity Types
    auto_detect_domain=True,            # Auto-detect domain via LLM
    custom_entity_types=None,           # Or specify: ["Person", "Location", ...]

    # Deduplication
    deduplicate_entities=True,          # Enable entity merging
    similarity_threshold=0.85,          # Similarity threshold (0.0–1.0)
    merge_strategy="smart",             # "conservative" | "smart" | "aggressive"

    # Visualization
    generate_visualization=True,        # Generate interactive viewer

    # API (any OpenAI-compatible endpoint)
    llm_model="gpt-4o-mini",
    embedding_model="text-embedding-3-small",
    api_key=os.environ["OPENAI_API_KEY"],
    base_url="https://api.openai.com/v1",
)

📂 Output Structure

my_rag/
├── rag_database/                    # LightRAG knowledge base
│   ├── vdb_entities.json            # Entity vectors
│   ├── vdb_relationships.json       # Relationship vectors
│   ├── vdb_chunks.json              # Text chunk vectors
│   ├── graph_chunk_entity_relation.graphml  # Knowledge graph
│   └── kv_store_*.json              # Key-value stores
├── visualization/                   # Interactive viewer
│   ├── index.html                   # Open this in browser
│   ├── graph_data.json              # Graph data
│   └── entity_details.json          # Entity descriptions
├── reports/
│   ├── deduplication_log.json       # Entity merge history
│   ├── learning_log.json            # Type learning results
│   └── entity_analysis.json         # Entity statistics
└── config.json                      # Build configuration

Viewing the Visualization

cd my_rag/visualization
python3 -m http.server 8888
# Open http://localhost:8888/index.html

🧠 Self-Learning System

The skill maintains a template file (entity_type_templates.json) that improves over time:

Build 1: 8 types (predefined) → LLM returns new types → learns +3
Build 2: 11 types → LLM returns more → learns +2
Build 5: 15 types → highly optimized for your domain

Managing Learned Types

# View current templates
python3 entity_type_learner.py show --domain literature

# Get optimized type list
python3 entity_type_learner.py optimize --domain medicine

# Learn from build logs
python3 entity_type_learner.py learn --log build.log --domain science --auto-merge

🔧 Supported LLM Providers

Any OpenAI-compatible API works:

Provider	Base URL	Models
OpenAI	`https://api.openai.com/v1`	gpt-4o-mini, gpt-4o
NVIDIA	`https://integrate.api.nvidia.com/v1`	meta/llama-3.1-8b-instruct
Ollama	`http://localhost:11434/v1`	llama2, mistral
Azure	`https://<endpoint>.openai.azure.com/`	gpt-4o
Together AI	`https://api.together.xyz/v1`	meta-llama/Llama-3-8b

export OPENAI_API_KEY="your-key"
export OPENAI_BASE_URL="https://api.openai.com/v1"

📁 File Structure

skills/rag-builder/
├── SKILL.md                         # Skill definition (for AI agents)
├── ENTITY_TYPE_LEARNING.md          # Self-learning documentation
├── entity_type_templates.json       # Domain templates (auto-updated)
└── scripts/
    ├── rag_builder_advanced.py      # Main builder (RAGBuilder class)
    ├── generate_visualization.py    # Interactive visualization generator
    ├── domain_detector.py           # Domain detection & type generation
    ├── entity_deduplicator.py       # Entity deduplication tool
    ├── entity_type_learner.py       # Self-learning type system
    ├── extract_text.py              # Universal text extractor
    └── rag_builder.py               # Simple builder (legacy)

📄 License

MIT License

👤 Author

Leo Yuan Tsao — GitHub

🙏 Acknowledgments

LightRAG — The underlying RAG framework
vis-network — Interactive graph visualization
NetworkX — Graph analysis

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
skills/rag-builder		skills/rag-builder
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Builder Skill

✨ Features

🔄 Entity Deduplication

🎯 Domain-Adaptive Entity Types

🧠 Self-Learning Type System

📊 Interactive Visualization

📁 Universal File Support

🚀 Installation

For Claude Code / Anthropic CLI

For StepFun / 小跃

For Other AI Agents

Prerequisites

📖 Usage

Quick Start (Python)

CLI Usage

Configuration Options

📂 Output Structure

Viewing the Visualization

🧠 Self-Learning System

Managing Learned Types

🔧 Supported LLM Providers

📁 File Structure

📄 License

👤 Author

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Builder Skill

✨ Features

🔄 Entity Deduplication

🎯 Domain-Adaptive Entity Types

🧠 Self-Learning Type System

📊 Interactive Visualization

📁 Universal File Support

🚀 Installation

For Claude Code / Anthropic CLI

For StepFun / 小跃

For Other AI Agents

Prerequisites

📖 Usage

Quick Start (Python)

CLI Usage

Configuration Options

📂 Output Structure

Viewing the Visualization

🧠 Self-Learning System

Managing Learned Types

🔧 Supported LLM Providers

📁 File Structure

📄 License

👤 Author

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages