DOCX Knowledge Base Builder

A powerful tool for converting DOCX technical documents into LLM training datasets. This project helps you build high-quality knowledge bases from technical documentation.

🌟 Features

📄 Smart DOCX document structure parsing
🤖 Automatic Q&A pair generation
🔄 Multiple output formats (Alpaca, Conversation)
📦 Batch processing support
✅ Data quality validation
📊 Document structure analysis

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/docx-knowledge-builder.git
cd docx-knowledge-builder

# Install dependencies
pip install -r requirements.txt

Basic Usage

Place your DOCX files in the project root directory
Run the extraction script:

python run_extraction.py

Check the generated data in training_data/ directory

📁 Project Structure

.
├── docx_knowledge_extractor.py    # Core extractor
├── run_extraction.py             # Main script
├── check_data.py                 # Data quality checker
├── requirements.txt              # Dependencies
├── README.md                     # This file
└── training_data/                # Output directory
    ├── combined_training_data_alpaca.json
    ├── combined_training_data_conversation.json
    └── *_structure.json

🔧 Advanced Usage

Single File Processing

python docx_knowledge_extractor.py -i "document.docx" -o output_dir

Batch Processing

python docx_knowledge_extractor.py -i documents_folder -o training_data --batch

📊 Output Formats

Alpaca Format

[
  {
    "instruction": "What is the main content?",
    "input": "",
    "output": "The main content is..."
  }
]

Conversation Format

[
  {
    "conversations": [
      {"from": "human", "value": "What is the main content?"},
      {"from": "gpt", "value": "The main content is..."}
    ]
  }
]

📚 Supported Document Types

Technical Specifications
Construction Plans
Quality Management Plans
Safety Management Plans
Other Technical Documentation

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
check_data.py		check_data.py
docx_knowledge_extractor.py		docx_knowledge_extractor.py
requirements.txt		requirements.txt
run_extraction.py		run_extraction.py
使用指南.md		使用指南.md
训练数据说明.md		训练数据说明.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DOCX Knowledge Base Builder

🌟 Features

🚀 Quick Start

Installation

Basic Usage

📁 Project Structure

🔧 Advanced Usage

Single File Processing

Batch Processing

📊 Output Formats

Alpaca Format

Conversation Format

📚 Supported Document Types

🤝 Contributing

📝 License

🙏 Acknowledgments

📞 Contact

🌟 Star History

About

Uh oh!

Releases

Packages

Languages

License

G2-star/docx-knowledge-builder

Folders and files

Latest commit

History

Repository files navigation

DOCX Knowledge Base Builder

🌟 Features

🚀 Quick Start

Installation

Basic Usage

📁 Project Structure

🔧 Advanced Usage

Single File Processing

Batch Processing

📊 Output Formats

Alpaca Format

Conversation Format

📚 Supported Document Types

🤝 Contributing

📝 License

🙏 Acknowledgments

📞 Contact

🌟 Star History

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages