Skip to content

⚙️ Advanced Spring AI RAG system with Serbian/Cyrillic optimization, full-stack chat interface, pgvector, LlamaIndex, and self-hosted LLMs — Microservices + Docker-ready

Notifications You must be signed in to change notification settings

vukmanovicmilos/spring-ai-examples

Repository files navigation

🤖 Spring AI Examples - Intelligent Serbian Document Query System

Serbian AI Chatbot Example

A comprehensive microservices-based application that enables AI-powered querying of Serbian documents using advanced NLP processing, vector embeddings, and retrieval-augmented generation (RAG). Specially optimized for Serbian Cyrillic text processing with state-of-the-art language models and preprocessing pipelines.

🇷🇸 Serbian Cyrillic Optimization

This system is specifically designed and optimized for Serbian language content, with comprehensive support for:

  • 📝 Cyrillic Script Processing: Full Serbian Cyrillic alphabet support (а-ш, ђ, ћ, ž, š, đ, č, ć)
  • 🔤 Script Detection: Automatic detection and optimization for Cyrillic, Latin, and mixed script content
  • 🧠 Advanced Serbian NLP: CLASSLA-powered morphological analysis, named entity recognition, and syntactic parsing
  • 🎯 Content-Aware Chunking: Dynamic text segmentation optimized for Serbian morphology and sentence structure
  • ⚡ Optimized Embeddings: Serbian-specific preprocessing pipeline for better vector representations
  • 🔍 Intelligent Retrieval: Script-type-aware similarity search with adaptive thresholds

🌟 Key Serbian Processing Features

Feature Description Benefits
Script Analysis Automatic Cyrillic/Latin detection with ratio calculation Optimal processing for mixed-script documents
Text Normalization Unicode NFC normalization, Serbian punctuation handling Consistent text representation
Morphological Processing CLASSLA-powered analysis of Serbian grammar Better understanding of complex morphology
Dynamic Chunking Content-adaptive splitting (20% larger for Cyrillic) Preserves semantic meaning
Query Optimization Script-aware retrieval parameters More relevant results for Serbian queries
Embedding Enhancement Cyrillic-optimized preprocessing pipeline Higher quality vector representations

🏗️ Architecture Overview

This application is a sophisticated microservices-based system that enables AI-powered querying of Serbian documents. It combines advanced Serbian NLP processing, optimized vector embeddings, and retrieval-augmented generation (RAG) to provide intelligent document search and question-answering capabilities specifically designed for Serbian Cyrillic content.

📊 Enhanced System Architecture for Serbian Processing

graph TB
    User["User - Serbian Queries"] --> Frontend
    
    Frontend["Angular Frontend - Port 4200"] --> Backend["Spring Boot Backend - Port 8088"]
    
    Backend --> OpenAI["OpenAI GPT-4o-mini"]
    Backend --> Ollama["Ollama + Gemma3 - Port 11434"]
    
    Backend --> CLASSLA["CLASSLA Service - Port 5001"]
    
    CLASSLA --> LlamaIndex["LlamaIndex App - Port 5002"]
    
    LlamaIndex --> Embedic["Embedic Service - Port 5003"]
    
    Backend --> PostgreSQL["PostgreSQL + PgVector - Port 5435"]
    LlamaIndex --> PostgreSQL
    Embedic --> PostgreSQL
    
    Backend -.-> CLASSLA
    CLASSLA -.-> LlamaIndex
    LlamaIndex -.-> Embedic
    Embedic -.-> PostgreSQL
    
    Frontend -.-> Backend
    Backend -.-> LlamaIndex
    LlamaIndex -.-> Backend
    Backend -.-> Frontend
    
    classDef user fill:#e1f5fe,stroke:#01579b
    classDef frontend fill:#f3e5f5,stroke:#4a148c
    classDef backend fill:#e8f5e8,stroke:#1b5e20
    classDef ai fill:#fff3e0,stroke:#e65100
    classDef serbian fill:#ffebee,stroke:#c62828
    classDef database fill:#fce4ec,stroke:#880e4f
    
    class User user
    class Frontend frontend
    class Backend backend
    class OpenAI,Ollama ai
    class CLASSLA,LlamaIndex,Embedic serbian
    class PostgreSQL database
Loading

🏗️ Dual Architecture: Two Processing Pipelines

🔹 Basic Pipeline (SpringAIService) - Simple & Fast

graph TB
    Input["PDF Document"] 
    Input --> Extract["PagePdfDocumentReader"]
    Extract --> Split["TokenTextSplitter"]
    Split --> Store["Spring AI VectorStore"]
    Store --> Retrieve["QuestionAnswerAdvisor"]
    Retrieve --> Output["AI Response"]
    
    classDef input fill:#e8f4f8,stroke:#1565c0
    classDef processing fill:#e3f2fd,stroke:#1976d2
    classDef output fill:#f3e5f5,stroke:#7b1fa2
    
    class Input input
    class Extract,Split,Store,Retrieve processing
    class Output output
Loading

🔹 Serbian-Optimized Pipeline (PdfAiQueryService) - Advanced & Specialized

graph TB
    PDFIn["Serbian PDF"] --> TikaExt["TikaDocumentReader"]
    TikaExt --> Validate["Serbian Text Validation"]
    Validate --> NLP["CLASSLA NLP Service"]
    NLP --> Chunk["LlamaIndex Service"]
    Chunk --> Embed["Embedic Service"]
    Embed --> DBStore["PostgreSQL + PgVector"]
    
    QueryIn["User Query"] --> Analyze["Script Analysis"]
    Analyze --> Search["Intelligent Retrieval"]
    Search --> Context["Context Processing"]
    Context --> AIGen["Serbian-Optimized AI"]
    AIGen --> ResponseOut["Formatted Response"]
    
    DBStore -.-> Search
    
    classDef serbian fill:#ffebee,stroke:#c62828
    classDef service fill:#e8f5e8,stroke:#2e7d32
    classDef nlp fill:#fff3e0,stroke:#ef6c00
    classDef query fill:#f1f8e9,stroke:#558b2f
    classDef ai fill:#fce4ec,stroke:#ad1457
    
    class PDFIn,TikaExt,Validate serbian
    class NLP,Chunk,Embed,DBStore service
    class QueryIn,Analyze query
    class Search,Context nlp
    class AIGen,ResponseOut ai
Loading

🎯 Architecture Comparison Overview

graph LR
    UserChoice["Choose Your Path"] --> BasicFlow["Basic Approach<br/>Simple 5-Step Process"]
    UserChoice --> AdvancedFlow["Serbian-Optimized<br/>Complex 11-Step Process"]
    
    BasicFlow --> BasicBenefits["Quick Setup<br/>Language Agnostic"]
    AdvancedFlow --> AdvancedBenefits["Cyrillic Support<br/>Advanced Serbian NLP"]
    
    classDef choice fill:#fff2cc,stroke:#d6b656
    classDef basic fill:#e1f5fe,stroke:#0277bd
    classDef advanced fill:#ffebee,stroke:#c62828
    classDef benefits fill:#f9f9f9,stroke:#666666
    
    class UserChoice choice
    class BasicFlow basic
    class AdvancedFlow advanced
    class BasicBenefits,AdvancedBenefits benefits
Loading

🏗️ Complete Microservices Architecture

graph TB
    UI["Angular Frontend<br/>Port: 4200"] --> API["Spring Boot Backend<br/>Port: 8088"]
    
    API --> BasicService["SpringAIService"]
    API --> AdvancedService["PdfAiQueryService"]
    
    BasicService --> OpenAI["OpenAI GPT-4o-mini"]
    AdvancedService --> Ollama["Ollama + Gemma3<br/>Port: 11434"]
    
    AdvancedService --> CLASSLA["CLASSLA NLP Service<br/>Port: 5001"]
    AdvancedService --> LlamaIdx["LlamaIndex Service<br/>Port: 5002"]
    AdvancedService --> Embedic["Embedic Service<br/>Port: 5003"]
    
    LlamaIdx --> PostgreSQL["PostgreSQL + PgVector<br/>Port: 5435"]
    Embedic --> PostgreSQL
    
    classDef frontend fill:#e3f2fd,stroke:#1976d2
    classDef backend fill:#f3e5f5,stroke:#7b1fa2
    classDef ai fill:#fff3e0,stroke:#ef6c00
    classDef serbian fill:#ffebee,stroke:#c62828
    classDef data fill:#e8f5e8,stroke:#2e7d32
    
    class UI frontend
    class API,BasicService,AdvancedService backend
    class OpenAI,Ollama ai
    class CLASSLA,LlamaIdx,Embedic serbian
    class PostgreSQL data
Loading

🔄 Enhanced Serbian Processing Pipeline

Document Ingestion Pipeline (Serbian Optimized)

  1. 📄 PDF Upload → Backend extracts text using Apache Tika
  2. 🔍 Cyrillic Analysis → Script detection, validation, and preprocessing
  3. 🧠 Serbian NLP → CLASSLA performs advanced Serbian language analysis with Cyrillic support
  4. 📦 Smart Chunking → LlamaIndex creates content-aware text chunks optimized for Serbian morphology
  5. 🔢 Optimized Vectors → Embedic generates high-quality embeddings with Serbian preprocessing
  6. 💾 Enhanced Storage → Vectors and Serbian metadata stored in PostgreSQL with PgVector

Query Processing Pipeline (Serbian Optimized)

  1. 💬 Serbian Query → Frontend sends question (Cyrillic/Latin/Mixed)
  2. 🔍 Script Analysis → Query script detection and optimization
  3. 📚 Intelligent Retrieval → Script-aware similarity search with adaptive thresholds
  4. 🤖 Serbian AI → Spring AI generates response using Serbian-optimized prompts
  5. 📱 Response Display → Formatted answer preserving Serbian language characteristics

🔄 Two Processing Approaches: Basic vs Serbian-Optimized

This system provides two distinct approaches for document processing and querying, allowing you to choose based on your specific needs:

📋 Approach Comparison

Feature SpringAIService (Basic) PdfAiQueryService (Serbian-Optimized)
Complexity Simple, 83 lines of code Advanced, 417 lines with specialized processing
Target Language Language-agnostic Serbian Cyrillic optimized
PDF Processing Standard PagePdfDocumentReader TikaDocumentReader + Cyrillic validation
Text Splitting Basic TokenTextSplitter Script-aware dynamic chunking via microservices
NLP Processing None Advanced Serbian NLP (CLASSLA service)
Text Analysis Basic Cyrillic content analysis, script detection
Preprocessing None Unicode normalization, Serbian punctuation handling
Embeddings Standard Spring AI Serbian-optimized embedic-large (1024-dim)
Context Optimization Basic Intelligent truncation preserving sentence boundaries
Microservices None 3 specialized services (CLASSLA, LlamaIndex, Embedic)
Metadata Tracking Basic Enhanced with Cyrillic analysis and processing stats
Error Handling Basic Comprehensive validation and quality checks
Performance Monitoring None Detailed logging with Serbian processing metrics

🎯 When to Use Each Approach

🔹 Use SpringAIService (Basic) when:

  • Processing non-Serbian documents or general multilingual content
  • Need quick setup with minimal configuration
  • Working with standard Latin-script documents
  • Require simple, straightforward document Q&A
  • Want lightweight processing without external dependencies
  • Prototyping or basic document search functionality

🔹 Use PdfAiQueryService (Serbian-Optimized) when:

  • Processing Serbian Cyrillic documents specifically
  • Need advanced language processing with morphological analysis
  • Require script detection and mixed Cyrillic/Latin handling
  • Want optimized embeddings for Serbian content
  • Need enhanced context retrieval with adaptive thresholds
  • Require comprehensive text validation and quality assurance
  • Working with complex Serbian documents requiring specialized NLP

🚀 Available API Endpoints

The system currently provides the following endpoints in ChatController:

📁 Document Ingestion Endpoints

Basic Approach (SpringAIService)
# Ingest using standard Spring AI processing
GET /api/ingest-pdf-openai
Serbian-Optimized Approach (PdfAiQueryService)
# Ingest using advanced Serbian Cyrillic processing
GET /api/ingest-pdf-local

💬 Query Endpoints

Current Implementation
# Query using basic SpringAIService approach
POST /api/chat
{
  "question": "Your question here"
}

🔧 Example Usage

# 1. Ingest Serbian document with Cyrillic optimization
curl -X GET http://localhost:8088/api/ingest-pdf-local

# 2. Query the processed document
curl -X POST http://localhost:8088/api/chat \
  -H "Content-Type: application/json" \
  -d '{"question": "Које су главне теме у документу?"}'

# 3. Alternative: Ingest with basic processing
curl -X GET http://localhost:8088/api/ingest-pdf-openai

💡 Note: Currently, the /api/chat endpoint uses the basic SpringAIService approach for queries. To fully utilize the Serbian-optimized processing pipeline, you could extend the ChatController with an additional endpoint:

@PostMapping("/serbian-chat")
public ResponseEntity<Map<String, String>> serbianChat(@RequestBody Map<String, String> payload) {
    String question = payload.get("question");
    String answer = pdfAiQueryService.query(question);  // Use Serbian-optimized query
    return ResponseEntity.ok(Map.of("answer", answer));
}

🛠️ Enhanced Technology Stack

Layer Technology Serbian Optimization
Frontend Angular 17, TypeScript, SCSS Cyrillic character support, Serbian UI
Backend Spring Boot 3, Spring AI, Java 17 Two processing approaches: Basic + Serbian-optimized
AI Models OpenAI GPT-4o-mini, Ollama Gemma3 Serbian system prompts, multilingual support
Serbian NLP CLASSLA, Stanza, Python Native Serbian processing, Cyrillic analysis
Vector Search LlamaIndex, PgVector Script-aware chunking, adaptive retrieval
Serbian Embeddings djovak/embedic-large (1024-dim) Serbian-optimized preprocessing
Database PostgreSQL 16, PgVector Serbian metadata, script statistics
Infrastructure Docker, Docker Compose Cyrillic locale support

📦 Enhanced Service Modules

🌟 Frontend (/frontend) - Serbian UI Support

  • Technology: Angular 17 + TypeScript + SCSS
  • Port: 4200
  • Serbian Features:
    • Full Cyrillic character input support
    • Serbian language interface elements
    • Mixed script query handling
    • Cyrillic-aware text rendering
    • Serbian typography optimization

🚀 Backend (/backend) - Serbian Processing Core

  • Technology: Spring Boot 3 + Spring AI + Apache Tika
  • Port: 8088
  • Serbian Optimizations:
    • Cyrillic Text Validation: Unicode normalization, encoding verification
    • Script Analysis: Automatic Cyrillic/Latin/Mixed detection
    • Text Preprocessing: Serbian punctuation normalization
    • Enhanced Prompts: Serbian-specific AI instruction templates
    • Quality Assurance: Text validation, character encoding checks
    • Performance Monitoring: Serbian processing statistics

🧠 CLASSLA Service (/stanza_classla_service) - Serbian NLP Excellence

  • Technology: Python + Flask + CLASSLA NLP Library
  • Port: 5001
  • Serbian Capabilities:
    • Script Detection: Cyrillic ratio calculation and classification
    • Text Normalization: Unicode NFC, Serbian punctuation handling
    • Morphological Analysis: Serbian-specific grammatical features
    • Named Entity Recognition: Serbian entity detection with transliteration
    • Tokenization: Cyrillic-aware word segmentation
    • Lemmatization: Serbian word root extraction
    • Transliteration: Cyrillic ↔ Latin conversion support

📚 LlamaIndex App (/llamaindex_app) - Intelligent Serbian Chunking

  • Technology: Python + Flask + LlamaIndex + PostgreSQL
  • Port: 5002
  • Serbian Features:
    • Dynamic Chunking: Content-adaptive splitting based on script type
      • Cyrillic Text: 20% larger chunks (4800 chars) with enhanced overlap
      • Mixed Script: 10% larger chunks (4400 chars) with balanced overlap
      • Latin Text: Standard chunking with optimized boundaries
    • Script-Aware Boundaries: Serbian sentence pattern recognition
    • Metadata Enhancement: Script type and ratio tracking
    • Query Optimization: Script-type-aware retrieval parameters

🔢 Embedic Large Service (/embedic_large_service) - Serbian Embedding Excellence

  • Technology: Python + Flask + SentenceTransformers
  • Port: 5003
  • Serbian Optimizations:
    • Model: djovak/embedic-large (1024-dimensional, Serbian-optimized)
    • Cyrillic Preprocessing: Unicode normalization, text cleaning
    • Content Analysis: Script detection, quality validation
    • Encoding Validation: UTF-8 normalization, replacement character detection
    • Statistics Tracking: Cyrillic content metrics, processing warnings
    • Batch Processing: Optimized for Serbian document collections

🦙 Ollama Service - Serbian AI Configuration

  • Technology: Ollama + Gemma3 Model
  • Port: 11434
  • Serbian Enhancements:
    • Model Configuration: Gemma3 with Serbian system prompts
    • Temperature Optimization: Lower values (0.1) for factual Serbian responses
    • Token Limits: Increased to 2000 for comprehensive Serbian answers
    • Serbian Context: Specialized instructions for Serbian document analysis

🐘 PostgreSQL + PgVector - Serbian Vector Storage

  • Technology: PostgreSQL 16 + PgVector Extension
  • Port: 5435
  • Serbian Features:
    • 1024-dim Vectors: Optimized for djovak/embedic-large
    • Script Metadata: Cyrillic ratio, script type tracking
    • Serbian Statistics: Processing metrics, content analysis
    • Enhanced Indexing: HNSW optimized for Serbian content similarity

🚀 Enhanced Quick Start for Serbian Content

Prerequisites

  • Docker & Docker Compose
  • Serbian PDF documents (Cyrillic or Latin script)
  • Node.js 18+ (for frontend development)
  • Java 17+ (for backend development)
  • Python 3.8+ (for service development)

🇷🇸 Launch Serbian Processing System

# Start the entire Serbian-optimized stack
docker-compose up -d

# Verify all services with Serbian support
docker-compose ps

# Check Serbian NLP service
curl http://localhost:5001/health

# Verify Serbian embedding service
curl http://localhost:5003/health

📄 Ingest Serbian PDF

# Upload Serbian PDF for processing
curl -X GET http://localhost:8088/api/ingest-pdf

# The system will:
# 1. Detect Cyrillic content
# 2. Apply Serbian preprocessing
# 3. Perform CLASSLA NLP analysis
# 4. Create optimized chunks
# 5. Generate Serbian-optimized embeddings

📊 Serbian Processing Features

🎯 Advanced Serbian NLP

  • Script Detection: Automatic Cyrillic/Latin identification
  • Morphological Analysis: Complex Serbian grammar processing
  • Entity Recognition: Serbian person/location/organization detection
  • Transliteration: Seamless Cyrillic ↔ Latin conversion
  • Normalization: Unicode and punctuation standardization

🔍 Intelligent Serbian Retrieval

  • Script-Aware Search: Query script detection and optimization
  • Adaptive Thresholds: Different relevance scores for script types
  • Context Optimization: Serbian morphology-aware chunking
  • Metadata Enhancement: Script statistics and content analysis

💬 Serbian Chat Interface

  • Cyrillic Support: Full Serbian alphabet input and display
  • Mixed Script Handling: Seamless Latin/Cyrillic processing
  • Language Preservation: Response maintains query language
  • Serbian Typography: Optimized text rendering

🔄 Serbian Processing Pipeline

  • Quality Validation: Text encoding and content verification
  • Preprocessing Optimization: Serbian-specific text cleaning
  • Enhanced Embeddings: Cyrillic-optimized vector generation
  • Performance Monitoring: Serbian processing metrics

Оптимизован за српски језик и ћириличко писмо 🇷🇸

About

⚙️ Advanced Spring AI RAG system with Serbian/Cyrillic optimization, full-stack chat interface, pgvector, LlamaIndex, and self-hosted LLMs — Microservices + Docker-ready

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published