Skip to content

A powerful tool for converting DOCX technical documents into LLM training datasets

License

Notifications You must be signed in to change notification settings

G2-star/docx-knowledge-builder

Repository files navigation

DOCX Knowledge Base Builder

English | 简体中文

A powerful tool for converting DOCX technical documents into LLM training datasets. This project helps you build high-quality knowledge bases from technical documentation.

Python Version License: MIT Documentation

🌟 Features

  • 📄 Smart DOCX document structure parsing
  • 🤖 Automatic Q&A pair generation
  • 🔄 Multiple output formats (Alpaca, Conversation)
  • 📦 Batch processing support
  • ✅ Data quality validation
  • 📊 Document structure analysis

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/docx-knowledge-builder.git
cd docx-knowledge-builder

# Install dependencies
pip install -r requirements.txt

Basic Usage

  1. Place your DOCX files in the project root directory
  2. Run the extraction script:
python run_extraction.py
  1. Check the generated data in training_data/ directory

📁 Project Structure

.
├── docx_knowledge_extractor.py    # Core extractor
├── run_extraction.py             # Main script
├── check_data.py                 # Data quality checker
├── requirements.txt              # Dependencies
├── README.md                     # This file
└── training_data/                # Output directory
    ├── combined_training_data_alpaca.json
    ├── combined_training_data_conversation.json
    └── *_structure.json

🔧 Advanced Usage

Single File Processing

python docx_knowledge_extractor.py -i "document.docx" -o output_dir

Batch Processing

python docx_knowledge_extractor.py -i documents_folder -o training_data --batch

📊 Output Formats

Alpaca Format

[
  {
    "instruction": "What is the main content?",
    "input": "",
    "output": "The main content is..."
  }
]

Conversation Format

[
  {
    "conversations": [
      {"from": "human", "value": "What is the main content?"},
      {"from": "gpt", "value": "The main content is..."}
    ]
  }
]

📚 Supported Document Types

  • Technical Specifications
  • Construction Plans
  • Quality Management Plans
  • Safety Management Plans
  • Other Technical Documentation

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Thanks to all contributors
  • Inspired by the need for high-quality LLM training data
  • Built with ❤️ for the open-source community

📞 Contact

🌟 Star History

Star History Chart

About

A powerful tool for converting DOCX technical documents into LLM training datasets

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages