A comprehensive microservices-based application that enables AI-powered querying of Serbian documents using advanced NLP processing, vector embeddings, and retrieval-augmented generation (RAG). Specially optimized for Serbian Cyrillic text processing with state-of-the-art language models and preprocessing pipelines.
This system is specifically designed and optimized for Serbian language content, with comprehensive support for:
- 📝 Cyrillic Script Processing: Full Serbian Cyrillic alphabet support (а-ш, ђ, ћ, ž, š, đ, č, ć)
- 🔤 Script Detection: Automatic detection and optimization for Cyrillic, Latin, and mixed script content
- 🧠 Advanced Serbian NLP: CLASSLA-powered morphological analysis, named entity recognition, and syntactic parsing
- 🎯 Content-Aware Chunking: Dynamic text segmentation optimized for Serbian morphology and sentence structure
- ⚡ Optimized Embeddings: Serbian-specific preprocessing pipeline for better vector representations
- 🔍 Intelligent Retrieval: Script-type-aware similarity search with adaptive thresholds
| Feature | Description | Benefits |
|---|---|---|
| Script Analysis | Automatic Cyrillic/Latin detection with ratio calculation | Optimal processing for mixed-script documents |
| Text Normalization | Unicode NFC normalization, Serbian punctuation handling | Consistent text representation |
| Morphological Processing | CLASSLA-powered analysis of Serbian grammar | Better understanding of complex morphology |
| Dynamic Chunking | Content-adaptive splitting (20% larger for Cyrillic) | Preserves semantic meaning |
| Query Optimization | Script-aware retrieval parameters | More relevant results for Serbian queries |
| Embedding Enhancement | Cyrillic-optimized preprocessing pipeline | Higher quality vector representations |
This application is a sophisticated microservices-based system that enables AI-powered querying of Serbian documents. It combines advanced Serbian NLP processing, optimized vector embeddings, and retrieval-augmented generation (RAG) to provide intelligent document search and question-answering capabilities specifically designed for Serbian Cyrillic content.
graph TB
User["User - Serbian Queries"] --> Frontend
Frontend["Angular Frontend - Port 4200"] --> Backend["Spring Boot Backend - Port 8088"]
Backend --> OpenAI["OpenAI GPT-4o-mini"]
Backend --> Ollama["Ollama + Gemma3 - Port 11434"]
Backend --> CLASSLA["CLASSLA Service - Port 5001"]
CLASSLA --> LlamaIndex["LlamaIndex App - Port 5002"]
LlamaIndex --> Embedic["Embedic Service - Port 5003"]
Backend --> PostgreSQL["PostgreSQL + PgVector - Port 5435"]
LlamaIndex --> PostgreSQL
Embedic --> PostgreSQL
Backend -.-> CLASSLA
CLASSLA -.-> LlamaIndex
LlamaIndex -.-> Embedic
Embedic -.-> PostgreSQL
Frontend -.-> Backend
Backend -.-> LlamaIndex
LlamaIndex -.-> Backend
Backend -.-> Frontend
classDef user fill:#e1f5fe,stroke:#01579b
classDef frontend fill:#f3e5f5,stroke:#4a148c
classDef backend fill:#e8f5e8,stroke:#1b5e20
classDef ai fill:#fff3e0,stroke:#e65100
classDef serbian fill:#ffebee,stroke:#c62828
classDef database fill:#fce4ec,stroke:#880e4f
class User user
class Frontend frontend
class Backend backend
class OpenAI,Ollama ai
class CLASSLA,LlamaIndex,Embedic serbian
class PostgreSQL database
graph TB
Input["PDF Document"]
Input --> Extract["PagePdfDocumentReader"]
Extract --> Split["TokenTextSplitter"]
Split --> Store["Spring AI VectorStore"]
Store --> Retrieve["QuestionAnswerAdvisor"]
Retrieve --> Output["AI Response"]
classDef input fill:#e8f4f8,stroke:#1565c0
classDef processing fill:#e3f2fd,stroke:#1976d2
classDef output fill:#f3e5f5,stroke:#7b1fa2
class Input input
class Extract,Split,Store,Retrieve processing
class Output output
graph TB
PDFIn["Serbian PDF"] --> TikaExt["TikaDocumentReader"]
TikaExt --> Validate["Serbian Text Validation"]
Validate --> NLP["CLASSLA NLP Service"]
NLP --> Chunk["LlamaIndex Service"]
Chunk --> Embed["Embedic Service"]
Embed --> DBStore["PostgreSQL + PgVector"]
QueryIn["User Query"] --> Analyze["Script Analysis"]
Analyze --> Search["Intelligent Retrieval"]
Search --> Context["Context Processing"]
Context --> AIGen["Serbian-Optimized AI"]
AIGen --> ResponseOut["Formatted Response"]
DBStore -.-> Search
classDef serbian fill:#ffebee,stroke:#c62828
classDef service fill:#e8f5e8,stroke:#2e7d32
classDef nlp fill:#fff3e0,stroke:#ef6c00
classDef query fill:#f1f8e9,stroke:#558b2f
classDef ai fill:#fce4ec,stroke:#ad1457
class PDFIn,TikaExt,Validate serbian
class NLP,Chunk,Embed,DBStore service
class QueryIn,Analyze query
class Search,Context nlp
class AIGen,ResponseOut ai
graph LR
UserChoice["Choose Your Path"] --> BasicFlow["Basic Approach<br/>Simple 5-Step Process"]
UserChoice --> AdvancedFlow["Serbian-Optimized<br/>Complex 11-Step Process"]
BasicFlow --> BasicBenefits["Quick Setup<br/>Language Agnostic"]
AdvancedFlow --> AdvancedBenefits["Cyrillic Support<br/>Advanced Serbian NLP"]
classDef choice fill:#fff2cc,stroke:#d6b656
classDef basic fill:#e1f5fe,stroke:#0277bd
classDef advanced fill:#ffebee,stroke:#c62828
classDef benefits fill:#f9f9f9,stroke:#666666
class UserChoice choice
class BasicFlow basic
class AdvancedFlow advanced
class BasicBenefits,AdvancedBenefits benefits
graph TB
UI["Angular Frontend<br/>Port: 4200"] --> API["Spring Boot Backend<br/>Port: 8088"]
API --> BasicService["SpringAIService"]
API --> AdvancedService["PdfAiQueryService"]
BasicService --> OpenAI["OpenAI GPT-4o-mini"]
AdvancedService --> Ollama["Ollama + Gemma3<br/>Port: 11434"]
AdvancedService --> CLASSLA["CLASSLA NLP Service<br/>Port: 5001"]
AdvancedService --> LlamaIdx["LlamaIndex Service<br/>Port: 5002"]
AdvancedService --> Embedic["Embedic Service<br/>Port: 5003"]
LlamaIdx --> PostgreSQL["PostgreSQL + PgVector<br/>Port: 5435"]
Embedic --> PostgreSQL
classDef frontend fill:#e3f2fd,stroke:#1976d2
classDef backend fill:#f3e5f5,stroke:#7b1fa2
classDef ai fill:#fff3e0,stroke:#ef6c00
classDef serbian fill:#ffebee,stroke:#c62828
classDef data fill:#e8f5e8,stroke:#2e7d32
class UI frontend
class API,BasicService,AdvancedService backend
class OpenAI,Ollama ai
class CLASSLA,LlamaIdx,Embedic serbian
class PostgreSQL data
- 📄 PDF Upload → Backend extracts text using Apache Tika
- 🔍 Cyrillic Analysis → Script detection, validation, and preprocessing
- 🧠 Serbian NLP → CLASSLA performs advanced Serbian language analysis with Cyrillic support
- 📦 Smart Chunking → LlamaIndex creates content-aware text chunks optimized for Serbian morphology
- 🔢 Optimized Vectors → Embedic generates high-quality embeddings with Serbian preprocessing
- 💾 Enhanced Storage → Vectors and Serbian metadata stored in PostgreSQL with PgVector
- 💬 Serbian Query → Frontend sends question (Cyrillic/Latin/Mixed)
- 🔍 Script Analysis → Query script detection and optimization
- 📚 Intelligent Retrieval → Script-aware similarity search with adaptive thresholds
- 🤖 Serbian AI → Spring AI generates response using Serbian-optimized prompts
- 📱 Response Display → Formatted answer preserving Serbian language characteristics
This system provides two distinct approaches for document processing and querying, allowing you to choose based on your specific needs:
| Feature | SpringAIService (Basic) | PdfAiQueryService (Serbian-Optimized) |
|---|---|---|
| Complexity | Simple, 83 lines of code | Advanced, 417 lines with specialized processing |
| Target Language | Language-agnostic | Serbian Cyrillic optimized |
| PDF Processing | Standard PagePdfDocumentReader | TikaDocumentReader + Cyrillic validation |
| Text Splitting | Basic TokenTextSplitter | Script-aware dynamic chunking via microservices |
| NLP Processing | None | Advanced Serbian NLP (CLASSLA service) |
| Text Analysis | Basic | Cyrillic content analysis, script detection |
| Preprocessing | None | Unicode normalization, Serbian punctuation handling |
| Embeddings | Standard Spring AI | Serbian-optimized embedic-large (1024-dim) |
| Context Optimization | Basic | Intelligent truncation preserving sentence boundaries |
| Microservices | None | 3 specialized services (CLASSLA, LlamaIndex, Embedic) |
| Metadata Tracking | Basic | Enhanced with Cyrillic analysis and processing stats |
| Error Handling | Basic | Comprehensive validation and quality checks |
| Performance Monitoring | None | Detailed logging with Serbian processing metrics |
- Processing non-Serbian documents or general multilingual content
- Need quick setup with minimal configuration
- Working with standard Latin-script documents
- Require simple, straightforward document Q&A
- Want lightweight processing without external dependencies
- Prototyping or basic document search functionality
- Processing Serbian Cyrillic documents specifically
- Need advanced language processing with morphological analysis
- Require script detection and mixed Cyrillic/Latin handling
- Want optimized embeddings for Serbian content
- Need enhanced context retrieval with adaptive thresholds
- Require comprehensive text validation and quality assurance
- Working with complex Serbian documents requiring specialized NLP
The system currently provides the following endpoints in ChatController:
# Ingest using standard Spring AI processing
GET /api/ingest-pdf-openai# Ingest using advanced Serbian Cyrillic processing
GET /api/ingest-pdf-local# Query using basic SpringAIService approach
POST /api/chat
{
"question": "Your question here"
}# 1. Ingest Serbian document with Cyrillic optimization
curl -X GET http://localhost:8088/api/ingest-pdf-local
# 2. Query the processed document
curl -X POST http://localhost:8088/api/chat \
-H "Content-Type: application/json" \
-d '{"question": "Које су главне теме у документу?"}'
# 3. Alternative: Ingest with basic processing
curl -X GET http://localhost:8088/api/ingest-pdf-openai💡 Note: Currently, the
/api/chatendpoint uses the basic SpringAIService approach for queries. To fully utilize the Serbian-optimized processing pipeline, you could extend the ChatController with an additional endpoint:
@PostMapping("/serbian-chat")
public ResponseEntity<Map<String, String>> serbianChat(@RequestBody Map<String, String> payload) {
String question = payload.get("question");
String answer = pdfAiQueryService.query(question); // Use Serbian-optimized query
return ResponseEntity.ok(Map.of("answer", answer));
}| Layer | Technology | Serbian Optimization |
|---|---|---|
| Frontend | Angular 17, TypeScript, SCSS | Cyrillic character support, Serbian UI |
| Backend | Spring Boot 3, Spring AI, Java 17 | Two processing approaches: Basic + Serbian-optimized |
| AI Models | OpenAI GPT-4o-mini, Ollama Gemma3 | Serbian system prompts, multilingual support |
| Serbian NLP | CLASSLA, Stanza, Python | Native Serbian processing, Cyrillic analysis |
| Vector Search | LlamaIndex, PgVector | Script-aware chunking, adaptive retrieval |
| Serbian Embeddings | djovak/embedic-large (1024-dim) | Serbian-optimized preprocessing |
| Database | PostgreSQL 16, PgVector | Serbian metadata, script statistics |
| Infrastructure | Docker, Docker Compose | Cyrillic locale support |
- Technology: Angular 17 + TypeScript + SCSS
- Port: 4200
- Serbian Features:
- Full Cyrillic character input support
- Serbian language interface elements
- Mixed script query handling
- Cyrillic-aware text rendering
- Serbian typography optimization
- Technology: Spring Boot 3 + Spring AI + Apache Tika
- Port: 8088
- Serbian Optimizations:
- Cyrillic Text Validation: Unicode normalization, encoding verification
- Script Analysis: Automatic Cyrillic/Latin/Mixed detection
- Text Preprocessing: Serbian punctuation normalization
- Enhanced Prompts: Serbian-specific AI instruction templates
- Quality Assurance: Text validation, character encoding checks
- Performance Monitoring: Serbian processing statistics
- Technology: Python + Flask + CLASSLA NLP Library
- Port: 5001
- Serbian Capabilities:
- Script Detection: Cyrillic ratio calculation and classification
- Text Normalization: Unicode NFC, Serbian punctuation handling
- Morphological Analysis: Serbian-specific grammatical features
- Named Entity Recognition: Serbian entity detection with transliteration
- Tokenization: Cyrillic-aware word segmentation
- Lemmatization: Serbian word root extraction
- Transliteration: Cyrillic ↔ Latin conversion support
- Technology: Python + Flask + LlamaIndex + PostgreSQL
- Port: 5002
- Serbian Features:
- Dynamic Chunking: Content-adaptive splitting based on script type
- Cyrillic Text: 20% larger chunks (4800 chars) with enhanced overlap
- Mixed Script: 10% larger chunks (4400 chars) with balanced overlap
- Latin Text: Standard chunking with optimized boundaries
- Script-Aware Boundaries: Serbian sentence pattern recognition
- Metadata Enhancement: Script type and ratio tracking
- Query Optimization: Script-type-aware retrieval parameters
- Dynamic Chunking: Content-adaptive splitting based on script type
- Technology: Python + Flask + SentenceTransformers
- Port: 5003
- Serbian Optimizations:
- Model:
djovak/embedic-large(1024-dimensional, Serbian-optimized) - Cyrillic Preprocessing: Unicode normalization, text cleaning
- Content Analysis: Script detection, quality validation
- Encoding Validation: UTF-8 normalization, replacement character detection
- Statistics Tracking: Cyrillic content metrics, processing warnings
- Batch Processing: Optimized for Serbian document collections
- Model:
- Technology: Ollama + Gemma3 Model
- Port: 11434
- Serbian Enhancements:
- Model Configuration: Gemma3 with Serbian system prompts
- Temperature Optimization: Lower values (0.1) for factual Serbian responses
- Token Limits: Increased to 2000 for comprehensive Serbian answers
- Serbian Context: Specialized instructions for Serbian document analysis
- Technology: PostgreSQL 16 + PgVector Extension
- Port: 5435
- Serbian Features:
- 1024-dim Vectors: Optimized for djovak/embedic-large
- Script Metadata: Cyrillic ratio, script type tracking
- Serbian Statistics: Processing metrics, content analysis
- Enhanced Indexing: HNSW optimized for Serbian content similarity
- Docker & Docker Compose
- Serbian PDF documents (Cyrillic or Latin script)
- Node.js 18+ (for frontend development)
- Java 17+ (for backend development)
- Python 3.8+ (for service development)
# Start the entire Serbian-optimized stack
docker-compose up -d
# Verify all services with Serbian support
docker-compose ps
# Check Serbian NLP service
curl http://localhost:5001/health
# Verify Serbian embedding service
curl http://localhost:5003/health# Upload Serbian PDF for processing
curl -X GET http://localhost:8088/api/ingest-pdf
# The system will:
# 1. Detect Cyrillic content
# 2. Apply Serbian preprocessing
# 3. Perform CLASSLA NLP analysis
# 4. Create optimized chunks
# 5. Generate Serbian-optimized embeddings- Script Detection: Automatic Cyrillic/Latin identification
- Morphological Analysis: Complex Serbian grammar processing
- Entity Recognition: Serbian person/location/organization detection
- Transliteration: Seamless Cyrillic ↔ Latin conversion
- Normalization: Unicode and punctuation standardization
- Script-Aware Search: Query script detection and optimization
- Adaptive Thresholds: Different relevance scores for script types
- Context Optimization: Serbian morphology-aware chunking
- Metadata Enhancement: Script statistics and content analysis
- Cyrillic Support: Full Serbian alphabet input and display
- Mixed Script Handling: Seamless Latin/Cyrillic processing
- Language Preservation: Response maintains query language
- Serbian Typography: Optimized text rendering
- Quality Validation: Text encoding and content verification
- Preprocessing Optimization: Serbian-specific text cleaning
- Enhanced Embeddings: Cyrillic-optimized vector generation
- Performance Monitoring: Serbian processing metrics
Оптимизован за српски језик и ћириличко писмо 🇷🇸
