Topicer

Topicer is a Python-based software framework for topic discovery and semantic tag proposal in large collections of textual documents.
It provides a unified Python API, pre-trained models, and REST services for easy deployment and integration.

Topicer is developed within the semANT – Semantic Explorer of Textual Cultural Heritage project and is designed with modularity, extensibility, and configurability in mind.

Key Features

Topic discovery
- Unsupervised discovery of topics in text collections
- Automatic topic naming and description generation using LLMs
- Dense and sparse topic–document assignment
Tag proposal
- Zero-shot and few-shot tagging
- Span-level localization of tags in text
- Multiple tagging backends (LLM-based, neural models)
Unified architecture
- Single factory-based initialization from YAML config
- Shared abstractions for LLMs, embeddings, and databases
Deployment options
- Python package API
- REST API server
- Docker-based deployment
Multilingual focus
- Optimized for highly inflective languages (e.g. Czech)

Installation

The latest version can be installed directly from GitHub:

pip install git+https://github.com/DCGM/topicer

Install a specific version:

pip install git+https://github.com/DCGM/topicer@TAG

Available releases are listed at: https://github.com/DCGM/topicer/tags

Usage Overview

Topicer is optimized for ease of use through a single factory function that constructs a fully configured method instance from a YAML file.

Basic Example

from topicer import factory

method = factory("method_config.yaml")

# Propose tags in a text chunk
result = await method.propose_tags(text_chunk, tags)

# Discover topics
topics = await method.discover_topics_dense(text_chunks)

Each method instance exposes a unified API and is fully self-contained.

Models

You can download the trained models here:

Model	Topicer	URL
Base	CrossBertTopicer	Download

Supported Functionality

Topic Discovery

Discovers topics in text collections
Generates:
- Topic names
- Topic descriptions
- Topic–text assignments
Supports:
- Dense representation (score for every topic–text pair)
- Sparse representation (only strongly associated pairs)

Implemented using:

FASTopic
SentenceTransformers (Gemma2-based embeddings)
Lemmatization via Morphodita
LLM-based topic naming and description (e.g. gpt-5-mini)

Tag Proposal

Topicer supports multiple tag proposal methods:

LLM-based Tagging (`LLMTopicer`)

Uses external LLMs (OpenAI, Ollama, etc.)
Identifies and localizes tag spans using a robust matching strategy
Suitable for zero-shot and few-shot tagging

GLiNER-based Tagging (`GlinerTopicer`)

Uses pretrained multilingual GLiNER models
Token-span based predictions with configurable thresholds
Supports single-label and multi-label modes

Cross-Encoder BERT Tagging (`CrossBertTopicer`)

Fine-tuned BERT cross-encoder models
Token-level scoring with span merging
Optimized for Czech-language tagging
Pretrained models distributed with the package

API Reference

All topicer methods implement a subset of the following async API:

async def discover_topics_sparse(texts, n=None)
async def discover_topics_dense(texts, n=None)

async def discover_topics_in_db_sparse(db_request, n=None)
async def discover_topics_in_db_dense(db_request, n=None)

async def propose_tags(text_chunk, tags)
async def propose_tags_in_db(tag, db_request)

If a method does not implement a requested function, a NotImplemented exception is raised.

Configuration

Topicer uses YAML-based configuration powered by classconfig.

Configuration Structure

topicer:
  cls: ClassNameImplementingATopicerMethod
  config:
    # method-specific parameters

llm_service:
  cls: ClassNameImplementingALLMService
  config:
    # LLM parameters

embedding_service:
  cls: ClassNameImplementingAnEmbeddingService
  config:
    # embedding parameters

db_connection:
  cls: ClassNameImplementingADatabaseService
  config:
    # database parameters

The configuration is parsed and validated automatically by the factory.

REST API

Topicer includes a REST API built with FastAPI.

Available Endpoints

/v1/configs
/v1/topics/discover/texts/sparse
/v1/topics/discover/texts/dense
/v1/topics/discover/db/sparse
/v1/topics/discover/db/dense
/v1/tags/propose/texts
/v1/tags/propose/db

Swagger documentation is available at:

http://localhost:8000/docs

Running the API Server

Local Execution

python -m venv venv
source venv/bin/activate
pip install -r topicer_api/requirements.txt
cd topicer_api
python run.py

Environment variables:

APP_HOST (default: 127.0.0.1)
APP_PORT (default: 8000)
TOPICER_API_CONFIGS_DIR
TOPICER_API_CONFIGS_EXTENSION

Docker Deployment

cd deploy
./update.sh build
./update.sh up

The API will be available at:

http://localhost:8080

GPU acceleration requires NVIDIA Container Toolkit. For CPU-only deployment, comment out the GPU section in compose.yaml.

Repository Structure

deploy/              Docker deployment
docs/                Method documentation
examples/            Usage examples
tests/               Unit tests
embedding_service/   Embedding REST API
topicer_api/         REST API server
topicer/
  ├─ database/       Database abstractions
  ├─ embedding/      Embedding services
  ├─ llm/            LLM service integrations
  ├─ tagging/        Tag proposal methods
  ├─ topic_discovery/Topic discovery methods
  ├─ utils/
  ├─ base.py         Public API & factory
  ├─ schemas.py      Pydantic schemas
  └─ __init__.py

Extending Topicer

Topicer is designed to be easily extensible:

Add new topicers in tagging/ or topic_discovery/
Implement new shared services in:
- llm/
- embedding/
- database/
All new classes must be imported in topicer/__init__.py

For API-level changes, please open a GitHub issue first.

Requirements

Python 3.12 recommended
Optional GPU for local inference
OpenAI API key or local Ollama server for LLM-based methods
Docker & Docker Compose for containerized deployment

License

BSD 3-Clause License

Acknowledgements

This software was developed with financial support from the Ministry of Culture of the Czech Republic under the NAKI III program (Project ID: DH23P03OVV060).

Authors: Martin Dočekal, Martin Kostelník, Marin Kišš, Richard Juřica, Michal Hradiš Brno University of Technology (VUT Brno)

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
deploy		deploy
docs		docs
examples		examples
ssh_tunnel_setup		ssh_tunnel_setup
tests		tests
topicer		topicer
topicer_api		topicer_api
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
config.yaml		config.yaml
download.sh		download.sh
pyproject.toml		pyproject.toml
requirements.bak		requirements.bak
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Topicer

Key Features

Installation

Usage Overview

Basic Example

Models

Supported Functionality

Topic Discovery

Tag Proposal

LLM-based Tagging (`LLMTopicer`)

GLiNER-based Tagging (`GlinerTopicer`)

Cross-Encoder BERT Tagging (`CrossBertTopicer`)

API Reference

Configuration

Configuration Structure

REST API

Available Endpoints

Running the API Server

Local Execution

Docker Deployment

Repository Structure

Extending Topicer

Requirements

License

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 6

Languages

License

DCGM/topicer

Folders and files

Latest commit

History

Repository files navigation

Topicer

Key Features

Installation

Usage Overview

Basic Example

Models

Supported Functionality

Topic Discovery

Tag Proposal

LLM-based Tagging (LLMTopicer)

GLiNER-based Tagging (GlinerTopicer)

Cross-Encoder BERT Tagging (CrossBertTopicer)

API Reference

Configuration

Configuration Structure

REST API

Available Endpoints

Running the API Server

Local Execution

Docker Deployment

Repository Structure

Extending Topicer

Requirements

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

LLM-based Tagging (`LLMTopicer`)

GLiNER-based Tagging (`GlinerTopicer`)

Cross-Encoder BERT Tagging (`CrossBertTopicer`)

Packages