Author: Leo Yuan Tsao
English | δΈζ
An advanced AI skill for building production-ready RAG (Retrieval-Augmented Generation) knowledge bases with entity deduplication, domain-adaptive entity types, self-learning type system, and interactive knowledge graph visualization.
Built on top of LightRAG.
Automatically detect and merge duplicate entities using multiple strategies:
- Substring matching β "Dante" + "Dante Alighieri" β merged
- Levenshtein similarity β "Virgil" + "Virgilius" β merged
- Context-based disambiguation β compare entity descriptions
- Three merge strategies:
conservative,smart(default),aggressive
Automatically detect document domain and generate appropriate entity types:
| Domain | Example Types |
|---|---|
| Literature | Character, Location, Event, Literary_Work, Symbol, Creature |
| Medicine | Disease, Symptom, Treatment, Drug, Anatomy, Biomarker |
| Law | Case, Court, Judge, Party, Statute, Precedent |
| Business | Company, Product, Technology, Market, Strategy, Metric |
| Science | Theory, Experiment, Researcher, Method, Finding, Hypothesis |
| History | Person, Event, Location, Organization, Time_Period, Artifact |
| Technology | Technology, Product, Algorithm, Protocol, Platform, Tool |
Custom domains are auto-detected via LLM analysis.
The system learns from every build:
- Captures entity types returned by the LLM
- Normalizes type names (
book/workβBook_Work) - Tracks usage frequency
- Auto-merges high-frequency types into templates
- Optimizes type lists for future builds
Auto-generated interactive knowledge graph viewer with:
- Entity type color coding β each type has a distinct color
- Click-to-inspect β click any node to see description, type, and connections
- Type filter bar β filter nodes by entity type
- Collapsible panels β left sidebar and right detail panel can be hidden
- Search β find entities by name
- Drag & zoom β rearrange and explore the graph
Extract text from 40+ file formats:
| Format | Extensions | Dependency |
|---|---|---|
.pdf |
PyPDF2 | |
| Word | .docx |
python-docx |
| PowerPoint | .pptx |
python-pptx |
| Excel | .xlsx .xls |
openpyxl |
| EPUB | .epub |
ebooklib |
| MOBI / AZW3 | .mobi .azw3 |
Calibre or mobi |
| Plain text | .txt .md .json .csv .yaml .xml ... |
β |
| Code | .py .js .ts .java .go .rs .cpp ... |
β |
All dependencies are auto-installed on first use.
# Clone the repo
git clone https://github.com/yuancafe/rag-builder.git
# Copy skill to Claude's skill directory
cp -r rag-builder/skills/rag-builder ~/.claude/skills/cp -r rag-builder/skills/rag-builder ~/.stepfun/skills/cp -r rag-builder/skills/rag-builder ~/.agents/skills/pip install lightrag-hku networkx python-LevenshteinOptional (auto-installed when needed):
pip install PyPDF2 python-docx python-pptx openpyxl ebooklib beautifulsoup4import os
import asyncio
import sys
sys.path.insert(0, os.path.expanduser("~/.stepfun/skills/rag-builder/scripts"))
from rag_builder_advanced import RAGBuilder
# Set your API key
os.environ["OPENAI_API_KEY"] = "your-api-key"
os.environ["OPENAI_BASE_URL"] = "https://api.openai.com/v1" # or any OpenAI-compatible endpoint
async def main():
builder = RAGBuilder(
files=["document.pdf", "notes.md"],
output_dir="./my_rag",
auto_detect_domain=True,
deduplicate_entities=True,
generate_visualization=True,
)
await builder.build()
asyncio.run(main())# Build RAG
python3 rag_builder_advanced.py document.pdf -o ./my_rag --auto-detect
# Detect domain & generate entity types
python3 domain_detector.py document.pdf -n 12 -o domain.json
# Analyze duplicate entities
python3 entity_deduplicator.py ./my_rag/rag_database -t 0.85
# Generate visualization for existing RAG
python3 generate_visualization.py ./my_rag/rag_database -o ./my_rag/visualization
# Extract text from any file
python3 extract_text.py document.epubRAGBuilder(
files=["doc1.pdf", "doc2.docx"], # Input files (any supported format)
output_dir="./my_rag", # Output directory
# Domain & Entity Types
auto_detect_domain=True, # Auto-detect domain via LLM
custom_entity_types=None, # Or specify: ["Person", "Location", ...]
# Deduplication
deduplicate_entities=True, # Enable entity merging
similarity_threshold=0.85, # Similarity threshold (0.0β1.0)
merge_strategy="smart", # "conservative" | "smart" | "aggressive"
# Visualization
generate_visualization=True, # Generate interactive viewer
# API (any OpenAI-compatible endpoint)
llm_model="gpt-4o-mini",
embedding_model="text-embedding-3-small",
api_key=os.environ["OPENAI_API_KEY"],
base_url="https://api.openai.com/v1",
)my_rag/
βββ rag_database/ # LightRAG knowledge base
β βββ vdb_entities.json # Entity vectors
β βββ vdb_relationships.json # Relationship vectors
β βββ vdb_chunks.json # Text chunk vectors
β βββ graph_chunk_entity_relation.graphml # Knowledge graph
β βββ kv_store_*.json # Key-value stores
βββ visualization/ # Interactive viewer
β βββ index.html # Open this in browser
β βββ graph_data.json # Graph data
β βββ entity_details.json # Entity descriptions
βββ reports/
β βββ deduplication_log.json # Entity merge history
β βββ learning_log.json # Type learning results
β βββ entity_analysis.json # Entity statistics
βββ config.json # Build configuration
cd my_rag/visualization
python3 -m http.server 8888
# Open http://localhost:8888/index.htmlThe skill maintains a template file (entity_type_templates.json) that improves over time:
Build 1: 8 types (predefined) β LLM returns new types β learns +3
Build 2: 11 types β LLM returns more β learns +2
Build 5: 15 types β highly optimized for your domain
# View current templates
python3 entity_type_learner.py show --domain literature
# Get optimized type list
python3 entity_type_learner.py optimize --domain medicine
# Learn from build logs
python3 entity_type_learner.py learn --log build.log --domain science --auto-mergeAny OpenAI-compatible API works:
| Provider | Base URL | Models |
|---|---|---|
| OpenAI | https://api.openai.com/v1 |
gpt-4o-mini, gpt-4o |
| NVIDIA | https://integrate.api.nvidia.com/v1 |
meta/llama-3.1-8b-instruct |
| Ollama | http://localhost:11434/v1 |
llama2, mistral |
| Azure | https://<endpoint>.openai.azure.com/ |
gpt-4o |
| Together AI | https://api.together.xyz/v1 |
meta-llama/Llama-3-8b |
export OPENAI_API_KEY="your-key"
export OPENAI_BASE_URL="https://api.openai.com/v1"skills/rag-builder/
βββ SKILL.md # Skill definition (for AI agents)
βββ ENTITY_TYPE_LEARNING.md # Self-learning documentation
βββ entity_type_templates.json # Domain templates (auto-updated)
βββ scripts/
βββ rag_builder_advanced.py # Main builder (RAGBuilder class)
βββ generate_visualization.py # Interactive visualization generator
βββ domain_detector.py # Domain detection & type generation
βββ entity_deduplicator.py # Entity deduplication tool
βββ entity_type_learner.py # Self-learning type system
βββ extract_text.py # Universal text extractor
βββ rag_builder.py # Simple builder (legacy)
MIT License
Leo Yuan Tsao β GitHub
- LightRAG β The underlying RAG framework
- vis-network β Interactive graph visualization
- NetworkX β Graph analysis