🤖 Spring AI Examples - Intelligent Serbian Document Query System

A comprehensive microservices-based application that enables AI-powered querying of Serbian documents using advanced NLP processing, vector embeddings, and retrieval-augmented generation (RAG). Specially optimized for Serbian Cyrillic text processing with state-of-the-art language models and preprocessing pipelines.

🇷🇸 Serbian Cyrillic Optimization

This system is specifically designed and optimized for Serbian language content, with comprehensive support for:

📝 Cyrillic Script Processing: Full Serbian Cyrillic alphabet support (а-ш, ђ, ћ, ž, š, đ, č, ć)
🔤 Script Detection: Automatic detection and optimization for Cyrillic, Latin, and mixed script content
🧠 Advanced Serbian NLP: CLASSLA-powered morphological analysis, named entity recognition, and syntactic parsing
🎯 Content-Aware Chunking: Dynamic text segmentation optimized for Serbian morphology and sentence structure
⚡ Optimized Embeddings: Serbian-specific preprocessing pipeline for better vector representations
🔍 Intelligent Retrieval: Script-type-aware similarity search with adaptive thresholds

🌟 Key Serbian Processing Features

Feature	Description	Benefits
Script Analysis	Automatic Cyrillic/Latin detection with ratio calculation	Optimal processing for mixed-script documents
Text Normalization	Unicode NFC normalization, Serbian punctuation handling	Consistent text representation
Morphological Processing	CLASSLA-powered analysis of Serbian grammar	Better understanding of complex morphology
Dynamic Chunking	Content-adaptive splitting (20% larger for Cyrillic)	Preserves semantic meaning
Query Optimization	Script-aware retrieval parameters	More relevant results for Serbian queries
Embedding Enhancement	Cyrillic-optimized preprocessing pipeline	Higher quality vector representations

🏗️ Architecture Overview

This application is a sophisticated microservices-based system that enables AI-powered querying of Serbian documents. It combines advanced Serbian NLP processing, optimized vector embeddings, and retrieval-augmented generation (RAG) to provide intelligent document search and question-answering capabilities specifically designed for Serbian Cyrillic content.

📊 Enhanced System Architecture for Serbian Processing

graph TB
    User["User - Serbian Queries"] --> Frontend
    
    Frontend["Angular Frontend - Port 4200"] --> Backend["Spring Boot Backend - Port 8088"]
    
    Backend --> OpenAI["OpenAI GPT-4o-mini"]
    Backend --> Ollama["Ollama + Gemma3 - Port 11434"]
    
    Backend --> CLASSLA["CLASSLA Service - Port 5001"]
    
    CLASSLA --> LlamaIndex["LlamaIndex App - Port 5002"]
    
    LlamaIndex --> Embedic["Embedic Service - Port 5003"]
    
    Backend --> PostgreSQL["PostgreSQL + PgVector - Port 5435"]
    LlamaIndex --> PostgreSQL
    Embedic --> PostgreSQL
    
    Backend -.-> CLASSLA
    CLASSLA -.-> LlamaIndex
    LlamaIndex -.-> Embedic
    Embedic -.-> PostgreSQL
    
    Frontend -.-> Backend
    Backend -.-> LlamaIndex
    LlamaIndex -.-> Backend
    Backend -.-> Frontend
    
    classDef user fill:#e1f5fe,stroke:#01579b
    classDef frontend fill:#f3e5f5,stroke:#4a148c
    classDef backend fill:#e8f5e8,stroke:#1b5e20
    classDef ai fill:#fff3e0,stroke:#e65100
    classDef serbian fill:#ffebee,stroke:#c62828
    classDef database fill:#fce4ec,stroke:#880e4f
    
    class User user
    class Frontend frontend
    class Backend backend
    class OpenAI,Ollama ai
    class CLASSLA,LlamaIndex,Embedic serbian
    class PostgreSQL database

🏗️ Dual Architecture: Two Processing Pipelines

🔹 Basic Pipeline (SpringAIService) - Simple & Fast

graph TB
    Input["PDF Document"] 
    Input --> Extract["PagePdfDocumentReader"]
    Extract --> Split["TokenTextSplitter"]
    Split --> Store["Spring AI VectorStore"]
    Store --> Retrieve["QuestionAnswerAdvisor"]
    Retrieve --> Output["AI Response"]
    
    classDef input fill:#e8f4f8,stroke:#1565c0
    classDef processing fill:#e3f2fd,stroke:#1976d2
    classDef output fill:#f3e5f5,stroke:#7b1fa2
    
    class Input input
    class Extract,Split,Store,Retrieve processing
    class Output output

🔹 Serbian-Optimized Pipeline (PdfAiQueryService) - Advanced & Specialized

graph TB
    PDFIn["Serbian PDF"] --> TikaExt["TikaDocumentReader"]
    TikaExt --> Validate["Serbian Text Validation"]
    Validate --> NLP["CLASSLA NLP Service"]
    NLP --> Chunk["LlamaIndex Service"]
    Chunk --> Embed["Embedic Service"]
    Embed --> DBStore["PostgreSQL + PgVector"]
    
    QueryIn["User Query"] --> Analyze["Script Analysis"]
    Analyze --> Search["Intelligent Retrieval"]
    Search --> Context["Context Processing"]
    Context --> AIGen["Serbian-Optimized AI"]
    AIGen --> ResponseOut["Formatted Response"]
    
    DBStore -.-> Search
    
    classDef serbian fill:#ffebee,stroke:#c62828
    classDef service fill:#e8f5e8,stroke:#2e7d32
    classDef nlp fill:#fff3e0,stroke:#ef6c00
    classDef query fill:#f1f8e9,stroke:#558b2f
    classDef ai fill:#fce4ec,stroke:#ad1457
    
    class PDFIn,TikaExt,Validate serbian
    class NLP,Chunk,Embed,DBStore service
    class QueryIn,Analyze query
    class Search,Context nlp
    class AIGen,ResponseOut ai

🎯 Architecture Comparison Overview

graph LR
    UserChoice["Choose Your Path"] --> BasicFlow["Basic Approach<br/>Simple 5-Step Process"]
    UserChoice --> AdvancedFlow["Serbian-Optimized<br/>Complex 11-Step Process"]
    
    BasicFlow --> BasicBenefits["Quick Setup<br/>Language Agnostic"]
    AdvancedFlow --> AdvancedBenefits["Cyrillic Support<br/>Advanced Serbian NLP"]
    
    classDef choice fill:#fff2cc,stroke:#d6b656
    classDef basic fill:#e1f5fe,stroke:#0277bd
    classDef advanced fill:#ffebee,stroke:#c62828
    classDef benefits fill:#f9f9f9,stroke:#666666
    
    class UserChoice choice
    class BasicFlow basic
    class AdvancedFlow advanced
    class BasicBenefits,AdvancedBenefits benefits

🏗️ Complete Microservices Architecture

graph TB
    UI["Angular Frontend<br/>Port: 4200"] --> API["Spring Boot Backend<br/>Port: 8088"]
    
    API --> BasicService["SpringAIService"]
    API --> AdvancedService["PdfAiQueryService"]
    
    BasicService --> OpenAI["OpenAI GPT-4o-mini"]
    AdvancedService --> Ollama["Ollama + Gemma3<br/>Port: 11434"]
    
    AdvancedService --> CLASSLA["CLASSLA NLP Service<br/>Port: 5001"]
    AdvancedService --> LlamaIdx["LlamaIndex Service<br/>Port: 5002"]
    AdvancedService --> Embedic["Embedic Service<br/>Port: 5003"]
    
    LlamaIdx --> PostgreSQL["PostgreSQL + PgVector<br/>Port: 5435"]
    Embedic --> PostgreSQL
    
    classDef frontend fill:#e3f2fd,stroke:#1976d2
    classDef backend fill:#f3e5f5,stroke:#7b1fa2
    classDef ai fill:#fff3e0,stroke:#ef6c00
    classDef serbian fill:#ffebee,stroke:#c62828
    classDef data fill:#e8f5e8,stroke:#2e7d32
    
    class UI frontend
    class API,BasicService,AdvancedService backend
    class OpenAI,Ollama ai
    class CLASSLA,LlamaIdx,Embedic serbian
    class PostgreSQL data

🔄 Enhanced Serbian Processing Pipeline

Document Ingestion Pipeline (Serbian Optimized)

📄 PDF Upload → Backend extracts text using Apache Tika
🔍 Cyrillic Analysis → Script detection, validation, and preprocessing
🧠 Serbian NLP → CLASSLA performs advanced Serbian language analysis with Cyrillic support
📦 Smart Chunking → LlamaIndex creates content-aware text chunks optimized for Serbian morphology
🔢 Optimized Vectors → Embedic generates high-quality embeddings with Serbian preprocessing
💾 Enhanced Storage → Vectors and Serbian metadata stored in PostgreSQL with PgVector

Query Processing Pipeline (Serbian Optimized)

💬 Serbian Query → Frontend sends question (Cyrillic/Latin/Mixed)
🔍 Script Analysis → Query script detection and optimization
📚 Intelligent Retrieval → Script-aware similarity search with adaptive thresholds
🤖 Serbian AI → Spring AI generates response using Serbian-optimized prompts
📱 Response Display → Formatted answer preserving Serbian language characteristics

🔄 Two Processing Approaches: Basic vs Serbian-Optimized

This system provides two distinct approaches for document processing and querying, allowing you to choose based on your specific needs:

📋 Approach Comparison

Feature	SpringAIService (Basic)	PdfAiQueryService (Serbian-Optimized)
Complexity	Simple, 83 lines of code	Advanced, 417 lines with specialized processing
Target Language	Language-agnostic	Serbian Cyrillic optimized
PDF Processing	Standard PagePdfDocumentReader	TikaDocumentReader + Cyrillic validation
Text Splitting	Basic TokenTextSplitter	Script-aware dynamic chunking via microservices
NLP Processing	None	Advanced Serbian NLP (CLASSLA service)
Text Analysis	Basic	Cyrillic content analysis, script detection
Preprocessing	None	Unicode normalization, Serbian punctuation handling
Embeddings	Standard Spring AI	Serbian-optimized embedic-large (1024-dim)
Context Optimization	Basic	Intelligent truncation preserving sentence boundaries
Microservices	None	3 specialized services (CLASSLA, LlamaIndex, Embedic)
Metadata Tracking	Basic	Enhanced with Cyrillic analysis and processing stats
Error Handling	Basic	Comprehensive validation and quality checks
Performance Monitoring	None	Detailed logging with Serbian processing metrics

🎯 When to Use Each Approach

🔹 Use SpringAIService (Basic) when:

Processing non-Serbian documents or general multilingual content
Need quick setup with minimal configuration
Working with standard Latin-script documents
Require simple, straightforward document Q&A
Want lightweight processing without external dependencies
Prototyping or basic document search functionality

🔹 Use PdfAiQueryService (Serbian-Optimized) when:

Processing Serbian Cyrillic documents specifically
Need advanced language processing with morphological analysis
Require script detection and mixed Cyrillic/Latin handling
Want optimized embeddings for Serbian content
Need enhanced context retrieval with adaptive thresholds
Require comprehensive text validation and quality assurance
Working with complex Serbian documents requiring specialized NLP

🚀 Available API Endpoints

The system currently provides the following endpoints in ChatController:

📁 Document Ingestion Endpoints

Basic Approach (SpringAIService)

# Ingest using standard Spring AI processing
GET /api/ingest-pdf-openai

Serbian-Optimized Approach (PdfAiQueryService)

# Ingest using advanced Serbian Cyrillic processing
GET /api/ingest-pdf-local

💬 Query Endpoints

Current Implementation

# Query using basic SpringAIService approach
POST /api/chat
{
  "question": "Your question here"
}

🔧 Example Usage

# 1. Ingest Serbian document with Cyrillic optimization
curl -X GET http://localhost:8088/api/ingest-pdf-local

# 2. Query the processed document
curl -X POST http://localhost:8088/api/chat \
  -H "Content-Type: application/json" \
  -d '{"question": "Које су главне теме у документу?"}'

# 3. Alternative: Ingest with basic processing
curl -X GET http://localhost:8088/api/ingest-pdf-openai

💡 Note: Currently, the /api/chat endpoint uses the basic SpringAIService approach for queries. To fully utilize the Serbian-optimized processing pipeline, you could extend the ChatController with an additional endpoint:

@PostMapping("/serbian-chat")
public ResponseEntity<Map<String, String>> serbianChat(@RequestBody Map<String, String> payload) {
    String question = payload.get("question");
    String answer = pdfAiQueryService.query(question);  // Use Serbian-optimized query
    return ResponseEntity.ok(Map.of("answer", answer));
}

🛠️ Enhanced Technology Stack

Layer	Technology	Serbian Optimization
Frontend	Angular 17, TypeScript, SCSS	Cyrillic character support, Serbian UI
Backend	Spring Boot 3, Spring AI, Java 17	Two processing approaches: Basic + Serbian-optimized
AI Models	OpenAI GPT-4o-mini, Ollama Gemma3	Serbian system prompts, multilingual support
Serbian NLP	CLASSLA, Stanza, Python	Native Serbian processing, Cyrillic analysis
Vector Search	LlamaIndex, PgVector	Script-aware chunking, adaptive retrieval
Serbian Embeddings	djovak/embedic-large (1024-dim)	Serbian-optimized preprocessing
Database	PostgreSQL 16, PgVector	Serbian metadata, script statistics
Infrastructure	Docker, Docker Compose	Cyrillic locale support

📦 Enhanced Service Modules

🌟 Frontend (`/frontend`) - Serbian UI Support

Technology: Angular 17 + TypeScript + SCSS
Port: 4200
Serbian Features:
- Full Cyrillic character input support
- Serbian language interface elements
- Mixed script query handling
- Cyrillic-aware text rendering
- Serbian typography optimization

🚀 Backend (`/backend`) - Serbian Processing Core

Technology: Spring Boot 3 + Spring AI + Apache Tika
Port: 8088
Serbian Optimizations:
- Cyrillic Text Validation: Unicode normalization, encoding verification
- Script Analysis: Automatic Cyrillic/Latin/Mixed detection
- Text Preprocessing: Serbian punctuation normalization
- Enhanced Prompts: Serbian-specific AI instruction templates
- Quality Assurance: Text validation, character encoding checks
- Performance Monitoring: Serbian processing statistics

🧠 CLASSLA Service (`/stanza_classla_service`) - Serbian NLP Excellence

Technology: Python + Flask + CLASSLA NLP Library
Port: 5001
Serbian Capabilities:
- Script Detection: Cyrillic ratio calculation and classification
- Text Normalization: Unicode NFC, Serbian punctuation handling
- Morphological Analysis: Serbian-specific grammatical features
- Named Entity Recognition: Serbian entity detection with transliteration
- Tokenization: Cyrillic-aware word segmentation
- Lemmatization: Serbian word root extraction
- Transliteration: Cyrillic ↔ Latin conversion support

📚 LlamaIndex App (`/llamaindex_app`) - Intelligent Serbian Chunking

Technology: Python + Flask + LlamaIndex + PostgreSQL
Port: 5002
Serbian Features:
- Dynamic Chunking: Content-adaptive splitting based on script type
  - Cyrillic Text: 20% larger chunks (4800 chars) with enhanced overlap
  - Mixed Script: 10% larger chunks (4400 chars) with balanced overlap
  - Latin Text: Standard chunking with optimized boundaries
- Script-Aware Boundaries: Serbian sentence pattern recognition
- Metadata Enhancement: Script type and ratio tracking
- Query Optimization: Script-type-aware retrieval parameters

🔢 Embedic Large Service (`/embedic_large_service`) - Serbian Embedding Excellence

Technology: Python + Flask + SentenceTransformers
Port: 5003
Serbian Optimizations:
- Model: djovak/embedic-large (1024-dimensional, Serbian-optimized)
- Cyrillic Preprocessing: Unicode normalization, text cleaning
- Content Analysis: Script detection, quality validation
- Encoding Validation: UTF-8 normalization, replacement character detection
- Statistics Tracking: Cyrillic content metrics, processing warnings
- Batch Processing: Optimized for Serbian document collections

🦙 Ollama Service - Serbian AI Configuration

Technology: Ollama + Gemma3 Model
Port: 11434
Serbian Enhancements:
- Model Configuration: Gemma3 with Serbian system prompts
- Temperature Optimization: Lower values (0.1) for factual Serbian responses
- Token Limits: Increased to 2000 for comprehensive Serbian answers
- Serbian Context: Specialized instructions for Serbian document analysis

🐘 PostgreSQL + PgVector - Serbian Vector Storage

Technology: PostgreSQL 16 + PgVector Extension
Port: 5435
Serbian Features:
- 1024-dim Vectors: Optimized for djovak/embedic-large
- Script Metadata: Cyrillic ratio, script type tracking
- Serbian Statistics: Processing metrics, content analysis
- Enhanced Indexing: HNSW optimized for Serbian content similarity

🚀 Enhanced Quick Start for Serbian Content

Prerequisites

Docker & Docker Compose
Serbian PDF documents (Cyrillic or Latin script)
Node.js 18+ (for frontend development)
Java 17+ (for backend development)
Python 3.8+ (for service development)

🇷🇸 Launch Serbian Processing System

# Start the entire Serbian-optimized stack
docker-compose up -d

# Verify all services with Serbian support
docker-compose ps

# Check Serbian NLP service
curl http://localhost:5001/health

# Verify Serbian embedding service
curl http://localhost:5003/health

📄 Ingest Serbian PDF

# Upload Serbian PDF for processing
curl -X GET http://localhost:8088/api/ingest-pdf

# The system will:
# 1. Detect Cyrillic content
# 2. Apply Serbian preprocessing
# 3. Perform CLASSLA NLP analysis
# 4. Create optimized chunks
# 5. Generate Serbian-optimized embeddings

📊 Serbian Processing Features

🎯 Advanced Serbian NLP

Script Detection: Automatic Cyrillic/Latin identification
Morphological Analysis: Complex Serbian grammar processing
Entity Recognition: Serbian person/location/organization detection
Transliteration: Seamless Cyrillic ↔ Latin conversion
Normalization: Unicode and punctuation standardization

🔍 Intelligent Serbian Retrieval

Script-Aware Search: Query script detection and optimization
Adaptive Thresholds: Different relevance scores for script types
Context Optimization: Serbian morphology-aware chunking
Metadata Enhancement: Script statistics and content analysis

💬 Serbian Chat Interface

Cyrillic Support: Full Serbian alphabet input and display
Mixed Script Handling: Seamless Latin/Cyrillic processing
Language Preservation: Response maintains query language
Serbian Typography: Optimized text rendering

🔄 Serbian Processing Pipeline

Quality Validation: Text encoding and content verification
Preprocessing Optimization: Serbian-specific text cleaning
Enhanced Embeddings: Cyrillic-optimized vector generation
Performance Monitoring: Serbian processing metrics

Оптимизован за српски језик и ћириличко писмо 🇷🇸

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
docker		docker
embedic_large_service		embedic_large_service
frontend		frontend
llamaindex_app		llamaindex_app
stanza_classla_service		stanza_classla_service
.gitignore		.gitignore
README.md		README.md
SERBIAN_OPTIMIZATION_GUIDE.md		SERBIAN_OPTIMIZATION_GUIDE.md
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
restart-optimized-services.ps1		restart-optimized-services.ps1
setup-gemma3.ps1		setup-gemma3.ps1
setup-gemma3.sh		setup-gemma3.sh
test-serbian-optimizations.ps1		test-serbian-optimizations.ps1

vukmanovicmilos/spring-ai-examples

Folders and files

Latest commit

History

Repository files navigation