Skip to content

VACInc/EchOnyx

Repository files navigation

⚠️ Work in Progress — EchOnyx is under active development and not yet ready for general use. Expect breaking changes, incomplete features, and rough edges.

EchOnyx

EchOnyx: Local, privacy-first video and presentation intelligence that runs entirely on your hardware. Designed for long-form meetings, demos, and reviews where details matter.

Features

  • Speech-to-text transcription (high-accuracy models, configurable)
  • Speaker diarization (who spoke when)
  • Scene/keyframe extraction and visual analysis (slides + screen content)
  • Structured summaries (executive summary, key points, actions, decisions)
  • Semantic search + question answering across all content
  • Ask mode keeps a local follow-up chat thread and reuses prior turns for grounded follow-up questions
  • User labels (tags) and label-scoped search/ask
  • First-class todos/action items with a dedicated list view, video-linked add/remove flows, and label-aware filtering
  • Retry (resume) and Reset (reprocess) pipeline controls
  • Duplicate detection with configurable suppression thresholds
  • Settings model management with verify/add flow for built-in entries and Hugging Face model ids

Quick Start (Docker)

Prereqs

  • Docker + Docker Compose
  • 32GB RAM minimum (128GB recommended for Strix Halo)
  • Hugging Face account + accepted pyannote terms if you want pyannote diarization

1) Configure

git clone <repository-url>
cd EchOnyx
cp .env.example .env

Update .env if you want diarization:

HF_TOKEN=hf_your_token_here

2) Run

AMD Strix Halo / ROCm:

docker compose -f docker-compose.yml -f docker-compose.amd.yml up -d

The managed AMD vision and summarization runtimes are internal-only now; they are no longer published on host ports by default.

NVIDIA:

docker compose -f docker-compose.yml -f docker-compose.nvidia.yml up -d

If you are building on a host without a visible GPU during docker build, set CUDA_ARCHITECTURES for your target cards. On the live ai-server, 86;120 was validated for RTX 3090 + RTX PRO 6000 Blackwell. The NVIDIA override now uses gpus: all, so normal Docker Compose exposes every visible NVIDIA GPU to backend and worker containers. The NVIDIA worker currently runs Celery with --pool=solo for stability while local CUDA llama.cpp vision and summarization loads are being hardened. On NVIDIA, the audio-event path now reads extracted WAV audio directly instead of depending on torchaudio file I/O, so a bad torchcodec runtime no longer blocks summarization. The CUDA image now builds llama-cpp-python against its bundled vendored llama.cpp by default; only opt into an external llama.cpp checkout if you are intentionally testing an upstream override. The NVIDIA endpoint services now self-place from live nvidia-smi free-memory data when explicit device pins are unset. On a single smaller GPU, they automatically switch to stage-by-stage endpoint loading instead of trying to keep both vision and summarization hot. On the live ai-server, the current mixed NVIDIA path is: summarization on a pinned 3090 via bundled-vendor CUDA llama.cpp, and vision on the RTX PRO 6000 via official vLLM. The default NVIDIA vision image now tracks v0.17.1 so newer Qwen families like Qwen3.5 are recognized. Summaries and ask answers now strip <think>...</think> reasoning blocks before they are stored or returned.

Apple Silicon / Metal: Run backend and worker on the host, not Docker. The initial Metal bring-up now auto-selects smaller defaults on unified-memory Macs:

  • WHISPER_MODEL=small
  • EMBEDDING_MODEL=nomic-ai/nomic-embed-text-v1.5
  • VISION_MODEL=Qwen2.5-VL-3B-Instruct.Q4_K_M.gguf
  • SUMMARIZATION_MODEL=Qwen2.5-3B-Instruct.Q4_K_M.gguf

On Apple host runs, the worker should use Celery --pool=solo, and the default Apple path now uses local project data/ directories instead of /data/.... Follow backend/README.md for the host-run commands. This path is meant to prove functionality on a 16 GB Apple Silicon machine, not the full high-capacity default stack yet.

3) Access

  • Frontend: http://localhost:3000
  • API: http://localhost:8000
  • FastAPI docs: http://localhost:8000/docs
  • PostgreSQL and Redis stay internal to the Compose network by default; they are no longer published on host ports unless you add your own override.

Configuration (Key Env Vars)

Set these in .env as needed:

  • GPU_BACKEND: cuda | metal | vulkan | rocm | cpu
  • MODEL_LOADING: sequential (low memory) or parallel
  • CORS_ALLOWED_ORIGINS: optional comma-separated explicit browser origins to trust
  • CORS_ALLOW_ORIGIN_REGEX: optional override for the default local/private-network browser-origin regex
  • AUTH_REQUIRED: keep true unless you intentionally want an unauthenticated local dev instance
  • AUTH_PASSWORD_HASH: optional preseeded local admin password hash
  • TRUST_PROXY_HEADERS, TRUSTED_PROXY_CIDRS: only enable these when EchOnyx sits behind a trusted reverse proxy that sets X-Forwarded-*
  • ALLOW_INSECURE_AUTH_HTTP: emergency/dev-only override; leave false in any real deployment
  • OIDC_ENABLED: enable external OIDC login, including Authentik
  • OIDC_PROVIDER_NAME, OIDC_ISSUER_URL, OIDC_CLIENT_ID, OIDC_CLIENT_SECRET: base OIDC provider config
  • OIDC_ALLOWED_EMAILS, OIDC_ALLOWED_GROUPS: optional allowlists for OIDC logins
  • OIDC_REDIRECT_URI, OIDC_FRONTEND_REDIRECT_URL: optional overrides when the default :8000 callback and :3000 frontend redirect are not correct
  • HF_TOKEN: optional overall, but required if you want pyannote diarization
  • VISION_ENDPOINT_URL, VISION_ENDPOINT_MODEL: use an external VL server
  • SUMMARIZATION_ENDPOINT_URL, SUMMARIZATION_ENDPOINT_MODEL: use an external LLM server
  • AUDIO_EVENT_CALIBRATION_PATH: optional JSON profile that overrides CLAP prompts and support thresholds
  • ROCM_LLM_RUNTIME: llama_server (managed idle teardown) or vllm
  • ROCM_LLM_IDLE_TIMEOUT_S: idle shutdown for ROCm llama_server endpoints
  • INSTALL_VLLM=1: opt-in build flag for the heavier ROCm vLLM image path
  • VLLM_INSTALL_METHOD: wheel (official ROCm wheel index) or source
  • CUDA_WHL_URL, CUDA_TORCH_VERSION, CUDA_TORCHAUDIO_VERSION, CUDA_TORCHVISION_VERSION: CUDA PyTorch image build controls
  • CUDA_ARCHITECTURES: optional CUDA arch list for llama.cpp image builds; use target SMs such as 86;120 for 3090 + RTX PRO 6000 Blackwell
  • CUDA_VISIBLE_DEVICES: leave it unset by default; setting it to an empty string hides all CUDA devices. When it is unset, the planner now narrows local llama.cpp loads to the selected CUDA devices automatically before the first import.
  • NVIDIA_VISION_VISIBLE_DEVICES, NVIDIA_SUMMARIZATION_VISIBLE_DEVICES: optional role-specific NVIDIA endpoint-service pinning overrides; when they are unset, the managed NVIDIA endpoints auto-pick the emptiest GPU that can fit the requested model instead of stealing a busier larger card
  • NVIDIA_ENDPOINT_IDLE_TIMEOUT_SECONDS: idle teardown for managed NVIDIA endpoint runtimes
  • LLAMA_BUILD_CUDA=1: enable CUDA llama.cpp builds in the NVIDIA backend image
  • INSTALL_NEMO=1: include NeMo so Canary ASR works in the NVIDIA image
  • Audio-event classification is treated as supporting context only; if that stage fails, summarization continues without audio hints
  • VISION_VLLM_MODEL_ID, SUMMARIZATION_VLLM_MODEL_ID: Hugging Face model ids for the vLLM runtime
  • EMBEDDING_MODEL: embedding model id (HF)
  • UPLOAD_DIR, MODEL_CACHE_DIR: storage locations

Models

Defaults are configured in .env.example. All models are swappable:

  • Transcription: WHISPER_MODEL (explicit ASR selector; no silent fallback)
  • Diarization: DIARIZATION_MODEL (pyannote, optional if HF_TOKEN is unset)
  • Vision: VISION_MODEL (GGUF) or VISION_ENDPOINT_*
  • Summarization: SUMMARIZATION_MODEL (GGUF) or SUMMARIZATION_ENDPOINT_*
  • Embeddings: EMBEDDING_MODEL (HF)
  • Audio Hints: AUDIO_EVENT_MODEL (defaults to CLAP for raw-audio source cues)

On Apple Silicon bring-up, the repo now defaults to smaller local models so a 16 GB Mac can process videos sequentially on Metal.

GGUF models can be downloaded automatically via the built-in model downloader when needed.

Audio Calibration

Use the fixture-driven calibration command to generate an AUDIO_EVENT_CALIBRATION_PATH profile from labeled media fixtures:

python -m app.core.audio_calibration \
  --manifest /path/to/audio-calibration-manifest.json \
  --output /data/models/audio_event_calibration.json

Manifest shape:

{
  "fixtures": [
    {
      "media_path": "/abs/path/to/demo-with-music.mp4",
      "expected_primary_key": "podcast_voiceover",
      "expected_supporting_keys": ["music_heavy"],
      "label": "demo_with_music",
      "use_for_calibration": true
    }
  ]
}

Relative media_path values are resolved from the manifest directory. The command accepts audio or video files and will extract temporary audio automatically when needed. Set use_for_calibration to false for exploratory fixtures you want to keep in the pack without letting them tune the default profile yet.

The repo now ships:

  • a checked-in fixture pack in backend/tests/fixtures/audio_calibration/ with both validated and exploratory clips
  • a conservative packaged baseline profile at backend/app/assets/audio_event_calibration.json

That packaged baseline loads automatically when /data/models/audio_event_calibration.json is absent, and a custom AUDIO_EVENT_CALIBRATION_PATH still overrides it.

Current note: the default packaged profile intentionally remains conservative. Live Strix Halo validation on March 10, 2026 confirmed that the real weather-radio and applause fixtures are audio-separable enough to keep in the active calibration path, while the current real meeting and software-demo fixtures remain exploratory because raw CLAP audio-only classification still collapses them toward produced narration. Primary prompt calibration now scores the real primary prompt variants instead of reusing one score per class, and the packaged baseline has been regenerated from the four validated fixtures. Exploratory fixtures are now also used as negative contrast during calibration, so they can help reject over-broad prompt choices without being promoted into the validated calibration set.

Regenerate the packaged baseline with:

python -m app.core.audio_calibration \
  --manifest backend/tests/fixtures/audio_calibration/manifest.json \
  --output backend/app/assets/audio_event_calibration.json

Operations Guide (Operator Focus)

Upload + Processing

  • Upload a video in the UI and processing starts immediately.
  • Processing steps: audio extraction → transcription → diarization → transcript merge → frame extraction → vision analysis → summarization → embedding.
  • If HF_TOKEN is not configured, diarization is skipped and the rest of the pipeline continues with transcript-only speaker data.

Retry vs Reset

  • Retry: resumes from the last successful step (idempotent).
  • Reset: restarts the entire pipeline from scratch.
  • Completed videos do not rerun by accident. A full rerun now requires an explicit forced reset or reprocess action.

Duplicate Handling

  • Duplicate policy is configurable in Settings.
  • Default behavior collapses exact duplicates out of default search results while keeping one representative indexed.
  • Suppressed duplicates can still be targeted directly with explicit video_id / video_ids search and ask requests.

Labels (Tags)

  • Add labels on a video’s detail page.
  • Use labels as filters in Search/Ask to target only those videos.

Todos / Action Items

  • Summary action items can be added into a real todo list instead of acting like one-off checkboxes.
  • The video detail page lets you add summary-derived todos, add manual todos, complete them, and remove them.
  • The dedicated /todos view lets you filter by completion state, text, and video labels.
  • /api/action-items exposes the same data for automation and later external sync work.

Model Management

  • Settings now exposes selectors for ASR, diarization, vision, summarization, embeddings, and audio events.
  • You can verify a built-in registry name or Hugging Face model id, then add it into the selector before saving.

Authentication

  • The UI/API now use a single local admin session.
  • First use requires creating the admin password through the sign-in gate or POST /api/auth/setup.
  • First-run password setup is now localhost-only by default. For remote first-run installs, preseed AUTH_PASSWORD_HASH or configure OIDC first.
  • After setup, access uses password login and Settings can rotate the password.
  • You can also enable OIDC for providers like Authentik. When OIDC_ENABLED=true, the sign-in gate adds a provider login button and the backend exchanges the auth code into the same local session cookies used by the rest of the app.
  • If you want OIDC-only login, leave AUTH_PASSWORD_HASH unset and configure the OIDC env vars instead.
  • Remote auth should run behind HTTPS. Non-loopback HTTP auth is now blocked by default unless ALLOW_INSECURE_AUTH_HTTP=true is set explicitly.

Delete Videos

  • Use the Delete button on a video detail page to remove the video and all associated data (artifacts + embeddings).

Acceptance Script

Use scripts/acceptance.sh for repeatable end-to-end checks.

It now verifies health, /api/settings, /api/settings/hardware, model status, upload/batch flow, summary/search/ask/similar, and the action-items CRUD/filter path.

Examples:

# Local Mac mini functional pass
ECHONYX_PASSWORD='<admin-password>' \
scripts/acceptance.sh \
  --base-url http://127.0.0.1:8000 \
  --primary-fixture /Users/vac/EchOnyx/tmp/mac-smoke/budget.mp4 \
  --secondary-fixture /Users/vac/EchOnyx/tmp/mac-smoke/probe.mp4 \
  --search-query "budget review" \
  --ask-question "When is the budget review due?" \
  --ask-expects "Friday" \
  --run-batch

# ai-server mixed NVIDIA pass
ECHONYX_PASSWORD='<admin-password>' \
scripts/acceptance.sh \
  --base-url http://192.168.1.147:8000 \
  --primary-fixture /Users/vac/EchOnyx/tmp/live-fixtures/probe1.mp4 \
  --secondary-fixture /Users/vac/EchOnyx/tmp/live-fixtures/probe2.mp4 \
  --search-query "budget review due Friday" \
  --ask-question "When is the budget review due?" \
  --ask-expects "Friday" \
  --run-batch

# Strix Halo non-disruptive health/models check
scripts/acceptance.sh --base-url http://192.168.1.178:8000 --read-only

Search & Q/A

Use the Search page to:

  • Search transcripts and summaries
  • Ask natural-language questions with follow-up chat in the same thread
  • Apply label filters to narrow the scope
  • Similar-video ranking now favors transcript and key-point overlap more heavily than generic narration style

Hardware Support

Profile Description Model Loading
Strix Halo AMD APU with 128GB unified memory Sequential (current default)
RTX 5090 Single high-VRAM NVIDIA GPU (32GB+) Parallel
Multi-GPU Multiple NVIDIA GPUs Parallel
CPU Only Fallback for systems without GPU Sequential

Current accelerator sizing guidance for the shipped model set:

  • Plan against free accelerator memory, not only installed VRAM or unified memory.
  • The Settings runtime panel now shows installed accelerator memory separately from the active free-memory budget.
  • Rough floor is about 24 GB free to run the largest current stage sequentially.
  • Practical single-accelerator target is about 32 GB free.
  • Keeping the worker-side models warm needs about 26.5 GB of budget.
  • Keeping worker-side models warm plus one local endpoint at a time needs about 50.5 GB of budget.
  • Keeping the whole current stack resident on one accelerator needs about 74.5 GB of budget, which is about 100 GB free at the default GPU_MEMORY_FRACTION=0.75.
  • On multi-GPU systems, the planner now prefers the emptiest accelerator that can fit the requested model set, then falls back to topology-aware spread.
  • CUDA worker-side models now honor the planner's preferred device selection, and the NVIDIA Compose override now defaults vision/summarization to dedicated CUDA llama_cpp.server containers instead of in-process worker loads.
  • The CUDA backend image now smoke-builds successfully on the live ai-server, and the mixed 3090 + RTX PRO 6000 runtime has now passed a live end-to-end acceptance run.
  • Embedding indexing now sanitizes Chroma metadata to scalar-safe values before insert so malformed slide/topic payloads do not fail the whole job near the end.

Current AMD note:

  • Strix Halo is treated as a ROCm-only profile; Vulkan and CPU fallbacks are rejected.
  • The AMD Docker override now supports two ROCm LLM endpoint paths behind the same OpenAI-compatible URLs:
    • llama_server: AMD ROCm llama.cpp, managed with idle teardown
    • vllm: vLLM OpenAI server for ROCm (opt-in image build)
  • The vllm path can load Hugging Face model ids directly while still serving the existing endpoint model names expected by the backend.
  • The AMD Docker override still uses AMD's published ROCm llama.cpp server artifact for gfx115X and fails closed if ROCm cannot enumerate a supported device.
  • Current AMD defaults target ROCm 7.2 for both the backend wheels and the dedicated GGUF server image.

Active Requirements

  • Strix Halo and other AMD systems must be fully functional on ROCm unless CPU execution is proven to be equally fast for the same stage/model.
  • Model residency must become dynamic instead of hard-coded:
    • detect available GPUs, VRAM/unified memory, and topology automatically
    • use current free memory, not only total VRAM, when deciding placement
    • determine whether all models can stay resident without unloading
    • support a configurable memory ceiling so the runtime keeps itself under a user-defined budget
    • decide whether models should be isolated, shared, or split across GPUs when the hardware supports it
  • Cold-start penalties for large embedding models need to be reduced on AMD so batch tails do not stall behind first-load startup costs.

Architecture (Brief)

Frontend (Next.js)
        |
Backend (FastAPI)
        |
Redis (queue) + Postgres (metadata) + ChromaDB (embeddings)
        |
Worker (Celery)
        |
Models (ASR, diarization, vision, LLM, embeddings)

Troubleshooting (Short)

  • Diarization missing: set HF_TOKEN and accept pyannote terms if you want speaker labels. Without it, uploads still process but diarization is skipped.
  • Out of memory: use MODEL_LOADING=sequential, smaller models, or external endpoints.
  • Model download errors: verify model IDs/filenames in .env and registry.
  • Jobs stuck: restart workers; stale job recovery will requeue.

Security Notes

  • Browser access is no longer wildcard-open by default. CORS now trusts explicit origins plus local/private-network browser origins, and job WebSockets apply the same origin check.
  • Uploads now enforce the size limit while streaming and reject files that do not probe as valid video media.
  • Summary responses now strip absolute slide image filesystem paths down to filenames before returning them to clients.
  • API/UI access now use a single-admin session with bootstrap setup, login/logout, and password rotation.
  • Protected routes require session auth; mutating routes require a matching CSRF token too.
  • Auth attempts, uploads, and mutating API operations are rate-limited and written to audit logs with retention cleanup.
  • JSON write routes now have request-size ceilings, and settings-side custom model/endpoint updates reject unsafe public HTTP endpoints and path-like model names.

Development (Minimal)

For local development, you can still run via Docker. If you run services directly:

  • Backend: backend/ (FastAPI + Celery)
  • Frontend: frontend/ (Next.js)

Tests:

pytest backend/tests

License

MIT License - see LICENSE.

Acknowledgments

  • faster-whisper
  • pyannote-audio
  • llama.cpp / llama-cpp-python
  • Qwen models

About

Local, privacy-first video and presentation intelligence that runs entirely on your hardware. Designed for long-form meetings, demos, and reviews where details matter.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors