Data ingestion module for Machine Assisted Development (MAD). This module fetches content from Confluence, processes it, and indexes it in Azure Search for advanced search capabilities.
- Fetch specific pages or entire spaces from Confluence
- Parse and extract document metadata from structured tables
- Handle different document types (ABRD, FBRD)
- Application Business Requirements Document (ABRD)
- Feature Business Requirements Document (FBRD)
- Split content into searchable sections based on headings
- Convert HTML content to well-structured Markdown
- Extract and index requirement IDs (e.g., FR-001, PR-001)
- Index full documents and sections in Azure Search
- Support for semantic search and hybrid queries
- Vector search capabilities using Azure OpenAI embeddings
- AI-generated document summaries using Azure OpenAI chat models
- SQLite-based persistent caching to avoid re-processing unchanged content
- Version tracking with unique document IDs in the search index
- Page configuration management system
- Clone this repository
- Create a virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows, use .venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Copy
env.templateto.env:cp env.template .env
- Edit
.envand fill in your Confluence and Azure Search credentials - For AI features, configure Azure OpenAI settings:
AZURE_OPENAI_API_KEY: Your Azure OpenAI API keyAZURE_OPENAI_ENDPOINT: Your Azure OpenAI endpointAZURE_OPENAI_DEPLOYMENT_NAME: The deployment name of your chat modelAZURE_OPENAI_API_VERSION: Use "2024-12-01-preview" or laterENABLE_SUMMARIZATION: Set to "true" to enable AI summaries
- Configure pages to process in
config/pages.json(see Page Configuration section)
There are two ways to run the application:
Run from the project root directory:
# Activate virtual environment
source .venv/bin/activate
# Process a specific page
python run.py --page-id 123456
# Process all pages in a space
python run.py --space-key MYSPACE
# Process all configured pages
python run.py --process-all# Navigate to src directory and activate environment
cd src
source ../.venv/bin/activate
# Process a specific page
python main.py --page-id 123456
# Process all pages in a space
python main.py --space-key MYSPACE
# Process all configured pages
python main.py --process-allTo process a page with AI-generated summaries:
ENABLE_SUMMARIZATION="true" python run.py --page-id 123456Or set ENABLE_SUMMARIZATION="true" in your .env file.
# List all configured pages and spaces
python run.py --list-pages
# Add a page to configuration
python run.py --add-page 123456 --page-name "My Document"
# Remove a page from configuration
python run.py --remove-page 123456# Show cache statistics
python run.py --cache-status
# Clear the cache
python run.py --clear-cache
# Force reindex a page (ignores cache)
python run.py --page-id 123456 --force-reindex--config-file: Specify an alternate config file path (default: config/pages.json)--log-level: Set logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)--dry-run: Process content but don't send to Azure Search--force-reindex: Force reprocessing even if the page version hasn't changed
.
├── config/ # Configuration files
│ └── pages.json # Page configuration
├── src/
│ ├── config/ # Configuration loading
│ ├── connectors/ # API clients for external services
│ │ ├── confluence/ # Confluence API client
│ │ └── azure_search/ # Azure Search client
│ ├── core/ # Core processing logic
│ ├── models/ # Data models
│ ├── services/ # External service integrations
│ └── utils/ # Utility functions
├── tests/ # Unit and integration tests
├── docs/ # Documentation
├── run.py # Convenience script to run from project root
├── env.template # Template for environment variables
├── requirements.txt # Python dependencies
└── README.md # This file
The application expects an Azure Search index with the following fields:
id(String): Unique identifier for the document (includes version info)content(String): The main content of the document or sectionsource_page_id(String): The ID of the source page in Confluencesource_page_title(String): The title of the source pagesource_url(String): URL to the source pageis_section(Boolean): Whether this is a section or full documentsection_id(String): ID of the section (if applicable)section_title(String): Title of the section (if applicable)section_level(Integer): Heading level of the sectionsection_number(String): Section number (e.g., "2.1.3")document_type(String): Type of document (ABRD, FBRD, UNKNOWN)project_code(String): Project code extracted from document IDdocument_id(String): Document ID from metadatadocument_version(String): Document versiondocument_status(String): Document status (DRAFT, APPROVED, etc.)created_date(String): When the document was createdlast_updated_date(String): When the document was last updateddocument_owner(String): Owner of the documentsummary(String): AI-generated summary (if enabled)requirement_ids(Collection(String)): Requirements IDs found in the sectionvector(Collection(float)): Vector embedding of the content for vector search
The application uses a JSON configuration file to manage pages and spaces to process. Example:
{
"pages": {
"<PAGE_ID>": {
"name": "Name of the Business requirements document",
"enabled": true,
"type": "ABRD",
"project": "<PROJECT_CODE>"
},
{
"<CONFLUENCE_PAGE_ID>": {
"name": "Name of the feature requirements document",
"enabled": true,
"type": "FBRD",
"project": "<PROJECT_CODE>"
}
},
"spaces": {
"<CONFLUENCE_SPACE_ID>": {
"name": "Name of the confluence space",
"enabled": true,
"description": "Main documentation space"
}
}
}