A powerful tool for converting DOCX technical documents into LLM training datasets. This project helps you build high-quality knowledge bases from technical documentation.
- 📄 Smart DOCX document structure parsing
- 🤖 Automatic Q&A pair generation
- 🔄 Multiple output formats (Alpaca, Conversation)
- 📦 Batch processing support
- ✅ Data quality validation
- 📊 Document structure analysis
# Clone the repository
git clone https://github.com/yourusername/docx-knowledge-builder.git
cd docx-knowledge-builder
# Install dependencies
pip install -r requirements.txt- Place your DOCX files in the project root directory
- Run the extraction script:
python run_extraction.py- Check the generated data in
training_data/directory
.
├── docx_knowledge_extractor.py # Core extractor
├── run_extraction.py # Main script
├── check_data.py # Data quality checker
├── requirements.txt # Dependencies
├── README.md # This file
└── training_data/ # Output directory
├── combined_training_data_alpaca.json
├── combined_training_data_conversation.json
└── *_structure.json
python docx_knowledge_extractor.py -i "document.docx" -o output_dirpython docx_knowledge_extractor.py -i documents_folder -o training_data --batch[
{
"instruction": "What is the main content?",
"input": "",
"output": "The main content is..."
}
][
{
"conversations": [
{"from": "human", "value": "What is the main content?"},
{"from": "gpt", "value": "The main content is..."}
]
}
]- Technical Specifications
- Construction Plans
- Quality Management Plans
- Safety Management Plans
- Other Technical Documentation
We welcome contributions! Please see our Contributing Guidelines for details.
This project is licensed under the MIT License - see the LICENSE file for details.
- Thanks to all contributors
- Inspired by the need for high-quality LLM training data
- Built with ❤️ for the open-source community
- GitHub Issues: Create an issue
- Email: agaid1mnjh45@gmail.com