- Strix Halo is required to run ROCm-only for all inference stages. CPU fallback is not an acceptable normal path there.
- The live Strix Halo default should remain
llama_serverfor now. - The
vllmruntime stays in the repo as an opt-in ROCm backend for future work and for later NVIDIA comparison. - Live validation on the Strix Halo remains part of the acceptance bar for runtime changes.
- Apple Silicon now has an initial Metal host-run bring-up path with small defaults, and the active live validation target is the local
16 GBMac mini. scripts/acceptance.shnow exists as the repeatable API-level acceptance entry point for local Mac runs,ai-server, and read-only Strix Halo checks, including runtime settings/hardware and action-items CRUD/filter coverage.- Action items now have a first-class todo system with
/api/action-items, a dedicated/todospage, and a Settings toggle for feature visibility. - The March 16, 2026 security pass now adds:
- single-admin auth with bootstrap setup, login/logout, and password rotation
- base OIDC support for providers like Authentik, including auth-code login, callback, allowlists, and reuse of the existing local session cookies
- CSRF on mutating routes
- auth/write/upload rate limits
- JSON request-size ceilings on non-upload API writes
- settings-side custom endpoint/model validation
- audit logging with retention cleanup
- forwarded-header trust is now opt-in, localhost bootstrap no longer trusts spoofed
X-Forwarded-For, cross-origin public auth posts are rejected, and non-loopback HTTP auth is blocked by default - the AMD managed vision/summarization runtimes are no longer published on host ports by default
- The earlier March 15, 2026 security pass fixed the low-friction network/upload leaks:
- browser CORS was narrowed from wildcard-open to explicit origins plus local/private-network browser origins
- job WebSockets now apply the same browser-origin check
- uploads now enforce size while streaming and reject files that fail media probing
- summary responses no longer leak absolute slide image filesystem paths
- Strix Halo currently behaves best with the managed ROCm
llama_serverpath plus idle teardown. - GPU idle must be validated with clocks and power, not only
rocm-smi --showuse. - The original "GPU pinned at 100%" issue on Strix Halo was real with persistent ROCm model-server processes; managed teardown resolved it for the
llama_serverpath. vllmon Strix Halogfx1151is not a good default today:- the FP8 model path is not viable on this GPU family
- the BF16 path can boot on ROCm, but startup is much slower
- large vision and summarization engines contend heavily for memory
- child engine processes must be torn down as a process group or VRAM remains pinned
- On Strix Halo,
vllmcurrently looks more like an experimental path than a production default. - For this hardware, the practical tradeoff today is:
llama_server: slower cold loads, but operationally reliablevllm: promising future path, but currently too heavy and fragile for default use
- The CLAP audio-event path is now live and is producing a single structured primary context for summarization instead of a noisy flat hint list.
- Live Strix Halo benchmark on March 10, 2026:
- CLAP classified a narrated demo clip as
produced narration or voice-overwith high confidence - the summary with audio context added explicit narration context and one extra evidence-based key point compared with the no-audio version
- the same summary call took about
27.3swith audio context versus13.7swithout it on a warm summarization endpoint
- CLAP classified a narrated demo clip as
- Current CLAP limitation from that benchmark:
- a light synthetic music bed did not clear the supporting-cue threshold, so soundtrack sensitivity still needs calibration
- Follow-up live Strix Halo benchmark on March 10, 2026 after the automatic support-scoring pass:
- the same narrated benchmark clip now emits both
produced narration or voice-overandnoticeable music bed or soundtrack - the saved audio-event artifact shows soundtrack support scoring around
0.94for that clip - the post-run GPU idle check still returned to
0%use at about607-609 MHz
- the same narrated benchmark clip now emits both
- The repo now includes a checked-in CLAP baseline fixture pack and packaged baseline profile:
- fixture manifest:
backend/tests/fixtures/audio_calibration/manifest.json - packaged runtime baseline:
backend/app/assets/audio_event_calibration.json - custom
AUDIO_EVENT_CALIBRATION_PATHstill overrides the packaged baseline when present
- fixture manifest:
- The checked-in CLAP fixture pack now distinguishes validated calibration fixtures from exploratory real clips:
- validated calibration path:
voiceover_no_musicvoiceover_with_musicbroadcast_weather_radioapplause_real
- exploratory fixtures kept out of default calibration for now:
meeting_room_realsoftware_demo_real
- validated calibration path:
- Live Strix Halo validation on March 10, 2026 established:
- the real NOAA weather-radio clip separated correctly as
broadcast playback - the real applause clip separated correctly as a
crowd_applausesupporting cue - the current real meeting and software-demo clips still collapsed toward produced narration in raw CLAP audio-only classification
- those exploratory real clips remain useful for benchmarking and future model or prompt work, but should not tune the default calibration path yet
- the real NOAA weather-radio clip separated correctly as
- CLAP calibration had a real primary-prompt-selection bug:
- primary prompt variants were not actually being rescored during calibration
- that bug is now fixed
- the packaged baseline profile has been regenerated from the four validated fixtures
- exploratory CLAP fixtures now participate as negative contrast during calibration:
- they can help reject over-broad prompt choices
- they still do not count as validated positives for the packaged baseline
- Current Mac-side exploratory CLAP measurements still show the same core limit:
meeting_room_realstill collapses topodcast_voiceoversoftware_demo_realstill leans towardpodcast_voiceover- newer screencast candidates can flip toward
meeting_room_speech, but they still do not separate cleanly enough to promotesoftware_demo
- Post-benchmark idle validation remained clean on Strix Halo:
- after processing and two direct summary comparisons, the GPU returned to
0%use at roughly608-609 MHz
- after processing and two direct summary comparisons, the GPU returned to
- Live Strix Halo duplicate handling is now in place:
- duplicate policy and thresholds are configurable through
/api/settingsand the Settings UI - completed videos reject accidental reruns unless
force=trueis used explicitly - exact duplicate uploads are marked in
videos.duplicate_info, skip standalone indexing, and are suppressed from default search results - explicit
video_idorvideo_idssearch and ask requests still allow targeting suppressed duplicates directly
- duplicate policy and thresholds are configurable through
- Live Strix Halo validation on March 12, 2026 confirmed:
- two repeated uploads of the same spoken probe were both classified as
exact_duplicatewithscore: 1.0 - default search for that probe content returned only the representative video after suppression
/api/videos/{id}now reports an active rerun asqueuedorprocessinginstead of incorrectly preferring an older completed job
- two repeated uploads of the same spoken probe were both classified as
- Live
ai-serverplanner validation on March 12, 2026 confirmed:- the machine exposes
1x RTX PRO 6000 Blackwell Workstation Editionplus6x RTX 3090 - the 6000 had about
97 GBfree and is the correct first-choice placement when the current model set fits on one GPU - the planner now uses current free memory and
nvidia-smi topo -mdata, not only static total VRAM - CUDA worker-side models now honor the planner-selected device index, and local
llama.cppmodels can use the planner-selected CUDA main GPU or tensor split - the NVIDIA Docker path now has a dedicated CUDA backend image with CUDA PyTorch wheels, CUDA
llama.cpp, and NeMo enabled by default - the CUDA image now keeps the
llama-cpp-pythonvendoredllama.cppby default instead of force-swapping in a separate checkout; upstream overrides are now explicit - the NVIDIA Compose override now uses
gpus: all; the earlier swarm-style reservation alone did not expose GPUs under normal Docker Compose - the default Compose stack now keeps PostgreSQL and Redis internal-only to avoid host port collisions during deployment
CUDA_VISIBLE_DEVICESmust stay unset by default on NVIDIA; setting it to an empty string hides all CUDA devices from PyTorch and CTranslate2- the CUDA backend image now smoke-builds successfully on that host after switching the image to a venv install, linking against CUDA driver stubs during
docker build, and settingCUDA_ARCHITECTURES=86;120 - the three NVLink-connected 3090 pairs are now recorded as fallback placement candidates when the 6000 cannot fit the active set
- live
ai-servervalidation on March 13, 2026 confirmed the CUDA Whisper path now loadsSystran/faster-whisper-large-v3correctly instead of the incompatible raw OpenAI snapshot layout - the next live blocker after that was pyannote credentials: when
HF_TOKENis missing, diarization now skips cleanly and the rest of the pipeline can continue - that same live CUDA pass also exposed a stale default vision-model config: the repo default now points at
Qwen3VL-32B-Instruct-Q4_K_M.ggufinstead of the old placeholderqwen3-omniGGUF path - live CUDA vision loads were not stable enough under Celery prefork; the NVIDIA worker now uses
--pool=solountil the localllama.cpppath is hardened further - local CUDA
llama.cppnow narrowsCUDA_VISIBLE_DEVICESto the planner-selected GPUs before first import so device numbering matches the planner on heterogeneous multi-GPU hosts - the mixed NVIDIA path now behaves best with summarization pinned to a
3090and vision pinned to theRTX PRO 6000 - direct CUDA
llama.cppon theRTX PRO 6000was stable enough for text summarization with the vendoredllama.cpp, but not forQwen3VL; the NVIDIA override now routes vision through officialvLLM - live NVIDIA vision now launches the correct
vLLMprocess on theRTX PRO 6000and is healthy after first-start model load - the next live CUDA blocker after endpoint startup was
torchaudiopulling in an incompatibletorchcodecruntime during the optional CLAP audio-event step inside summarization - the audio-event path now reads extracted WAV files directly, and the worker now treats audio-event classification as fail-soft so summary generation can continue without audio hints when that optional stage breaks
- current accelerator sizing guidance for the shipped model set is:
- rough floor: about
24 GBfree - practical single-accelerator target: about
32 GBfree - warm worker models plus one local endpoint: about
50.5 GBof budget - single-accelerator fully resident target: about
100 GBfree at the default memory fraction
- rough floor: about
- live
ai-serveracceptance on March 14, 2026 now covers single upload, batch upload, summary retrieval, search, ask, and similar on the mixed3090 + RTX PRO 6000split - the runtime planner budget on large multi-GPU hosts is intentionally computed from currently free accelerator memory, while the Settings UI now shows installed accelerator memory separately so the two numbers are not conflated
- the managed NVIDIA endpoint runtime must treat
NVIDIA_VISION_VISIBLE_DEVICESandNVIDIA_SUMMARIZATION_VISIBLE_DEVICESas role-specific; reading the wrong one can pin summarization onto the vision GPU and deadlock the pipeline - endpoint startup retries are now bounded so a dead
503 Loading modelloop fails instead of hanging jobs for many minutes - embedding indexing now sanitizes Chroma metadata values before insert so malformed nested slide/topic metadata cannot fail the whole job at the final embedding step
- both
askresponses and generated summaries now strip<think>...</think>reasoning blocks before returning or persisting model output - the runtime plan now exposes explicit worker execution mode, endpoint loading mode, per-model placement, and whether endpoint services should unload after each request
- managed NVIDIA endpoints now auto-pick GPUs from current
nvidia-smifree-memory data when explicit pins are unset - that NVIDIA auto-pick path now prefers the emptiest GPU that can fit the requested model instead of drifting onto a busier larger card with only slightly more free VRAM
- older
vLLM 0.11.2vision images were too old to recognizeqwen3_5; the default NVIDIA vision image now tracksv0.17.1 - on a single smaller NVIDIA GPU, managed endpoints now switch to stage-by-stage loading instead of trying to keep both large endpoint models resident
- Apple Silicon now auto-detects as
hardware_profile=apple_siliconwithgpu_backend=metal - the initial Apple bring-up path uses smaller default models on unified-memory Macs:
smallWhispernomic-ai/nomic-embed-text-v1.5Qwen2.5-VL-3B-Instruct.Q4_K_M.ggufQwen2.5-3B-Instruct.Q4_K_M.gguf
- on smaller Apple unified-memory systems, the planner now forces full stage-by-stage loading instead of trying to keep a hybrid resident set hot
- the repo bootstrap scripts now detect Apple Silicon and stop recommending the old Vulkan Strix-Halo path
- live Mac mini validation on March 15, 2026 exposed three host-run requirements:
- default Apple host paths should be repo-local
data/directories instead of/data/... - the Apple worker should use Celery
--pool=solo - the Metal Whisper transformers path needs the
soundfiledependency installed
- default Apple host paths should be repo-local
- Ask mode now behaves as a real follow-up chat:
/api/search/askaccepts optional prior turns- retrieval expands follow-up queries with recent user turns
- the Search UI keeps a local conversation thread instead of replacing the last answer
- Summary action items can now be promoted into a dedicated todo list:
/api/action-itemssupports create/update/delete/list- list filtering can use linked video labels
- the video detail page can add summary suggestions or manual todos directly
- the
/todospage is the global check-off view
- Settings now exposes model selectors for all main runtime model slots plus a verify/add flow:
- built-in catalog entries can be reselected directly
- custom candidates can be checked against the built-in GGUF registry or Hugging Face before being added
- Similar-video ranking now weights transcript and key-point overlap more heavily than generic summary tone, which reduces false neighbors for generic narrated videos
- the machine exposes
- Keep AMD/Strix Halo fully functional on ROCm unless CPU execution is benchmarked to be equally fast for the exact stage and model in question.
- Run a full end-to-end security review across API, frontend, model/runtime orchestration, remote endpoints, secrets handling, uploads, search/ask flows, and deployment surfaces so the product is defensible as a secure local-first system.
- Security enhancement ranking after the March 16, 2026 pass:
- High:
- add per-user auth/RBAC if the product needs multi-user or shared-host deployments; the current model is still intentionally single-admin even with OIDC
- add stricter export/search quotas once real production traffic patterns are known
- complete the remaining end-to-end security review against the newly secured live targets and deployment docs
- Medium:
- add stronger reverse-proxy/web-server headers, especially CSP, if the app is fronted by a production ingress
- add optional session/device management views if operators need to revoke other active sessions
- review artifact-retention controls for uploads, transcripts, slides, embeddings, and exports beyond the new audit-log retention
- Low:
- tighten origin defaults further for named-host deployments beyond localhost/private-network use
- add optional MFA only if the product moves beyond the current trusted local-admin model
- High:
- Standardize the Strix Halo host baseline around a known-good ROCm/kernel combination and verify it during live acceptance so ROCm regressions are caught outside the app code too.
- Replace the current fixed model-loading behavior with a dynamic residency planner that:
- detects GPU count, memory, and topology automatically
- determines whether models can remain resident instead of unloading between stages
- respects a configurable memory ceiling so the runtime stays under a user-defined VRAM or unified-memory budget
- decides when models should be pinned to one GPU, shared across GPUs, or split across GPUs
- Reduce AMD cold-start overhead for large embedding models so the final embedding stage does not dominate batch tail latency.
- Keep live validation on the Strix Halo as part of the acceptance bar, including small real video fixtures.
- Replace automatic transcription fallback with an explicit ASR model switcher or selector so one chosen ASR path runs deterministically instead of silently falling back to Whisper.
- Add full NVIDIA runtime support after the AMD ROCm path is solid.
- Live-validate the new macOS Metal bring-up path on an actual Apple Silicon machine.
- Keep sourcing better real meeting-room and software-demo fixtures until CLAP can separate them cleanly enough to promote them from exploratory to validated calibration inputs.
- Add external sync/export for the action-item system once the target integration format is decided.
- Let the summarization model reconcile CLAP-derived audio hints with transcript, slides, and vision context, but do not rely on the transcription model alone for audio-scene hints because ASR text drops nonverbal audio information.
- Keep benchmarking whether CLAP-derived audio context materially improves summaries enough to justify the extra prompt and latency cost on Strix Halo and NVIDIA.
- Incremental recomputation with step-level caching and lineage or versioning for transcripts, frames, vision metadata, and summaries.
- Multi-index retrieval routing (transcript, slides, vision) with reranking for better long-video QA.
- Optional live transcription plus audio capture or mixing mode for real-time meetings and demos.
- Log clarity: show the actual ASR backend instead of the generic
whisperlabel. - Ensure UI and logs reflect Canary when the ASR backend is Canary.