Project Status And Backlog

Current Operating Position

Strix Halo is required to run ROCm-only for all inference stages. CPU fallback is not an acceptable normal path there.
The live Strix Halo default should remain llama_server for now.
The vllm runtime stays in the repo as an opt-in ROCm backend for future work and for later NVIDIA comparison.
Live validation on the Strix Halo remains part of the acceptance bar for runtime changes.
Apple Silicon now has an initial Metal host-run bring-up path with small defaults, and the active live validation target is the local 16 GB Mac mini.
scripts/acceptance.sh now exists as the repeatable API-level acceptance entry point for local Mac runs, ai-server, and read-only Strix Halo checks, including runtime settings/hardware and action-items CRUD/filter coverage.
Action items now have a first-class todo system with /api/action-items, a dedicated /todos page, and a Settings toggle for feature visibility.
The March 16, 2026 security pass now adds:
- single-admin auth with bootstrap setup, login/logout, and password rotation
- base OIDC support for providers like Authentik, including auth-code login, callback, allowlists, and reuse of the existing local session cookies
- CSRF on mutating routes
- auth/write/upload rate limits
- JSON request-size ceilings on non-upload API writes
- settings-side custom endpoint/model validation
- audit logging with retention cleanup
- forwarded-header trust is now opt-in, localhost bootstrap no longer trusts spoofed X-Forwarded-For, cross-origin public auth posts are rejected, and non-loopback HTTP auth is blocked by default
- the AMD managed vision/summarization runtimes are no longer published on host ports by default
The earlier March 15, 2026 security pass fixed the low-friction network/upload leaks:
- browser CORS was narrowed from wildcard-open to explicit origins plus local/private-network browser origins
- job WebSockets now apply the same browser-origin check
- uploads now enforce size while streaming and reject files that fail media probing
- summary responses no longer leak absolute slide image filesystem paths

Validated Findings And Constraints

Strix Halo currently behaves best with the managed ROCm llama_server path plus idle teardown.
GPU idle must be validated with clocks and power, not only rocm-smi --showuse.
The original "GPU pinned at 100%" issue on Strix Halo was real with persistent ROCm model-server processes; managed teardown resolved it for the llama_server path.
vllm on Strix Halo gfx1151 is not a good default today:
- the FP8 model path is not viable on this GPU family
- the BF16 path can boot on ROCm, but startup is much slower
- large vision and summarization engines contend heavily for memory
- child engine processes must be torn down as a process group or VRAM remains pinned
On Strix Halo, vllm currently looks more like an experimental path than a production default.
For this hardware, the practical tradeoff today is:
- llama_server: slower cold loads, but operationally reliable
- vllm: promising future path, but currently too heavy and fragile for default use
The CLAP audio-event path is now live and is producing a single structured primary context for summarization instead of a noisy flat hint list.
Live Strix Halo benchmark on March 10, 2026:
- CLAP classified a narrated demo clip as produced narration or voice-over with high confidence
- the summary with audio context added explicit narration context and one extra evidence-based key point compared with the no-audio version
- the same summary call took about 27.3s with audio context versus 13.7s without it on a warm summarization endpoint
Current CLAP limitation from that benchmark:
- a light synthetic music bed did not clear the supporting-cue threshold, so soundtrack sensitivity still needs calibration
Follow-up live Strix Halo benchmark on March 10, 2026 after the automatic support-scoring pass:
- the same narrated benchmark clip now emits both produced narration or voice-over and noticeable music bed or soundtrack
- the saved audio-event artifact shows soundtrack support scoring around 0.94 for that clip
- the post-run GPU idle check still returned to 0% use at about 607-609 MHz
The repo now includes a checked-in CLAP baseline fixture pack and packaged baseline profile:
- fixture manifest: backend/tests/fixtures/audio_calibration/manifest.json
- packaged runtime baseline: backend/app/assets/audio_event_calibration.json
- custom AUDIO_EVENT_CALIBRATION_PATH still overrides the packaged baseline when present
The checked-in CLAP fixture pack now distinguishes validated calibration fixtures from exploratory real clips:
- validated calibration path:
  - voiceover_no_music
  - voiceover_with_music
  - broadcast_weather_radio
  - applause_real
- exploratory fixtures kept out of default calibration for now:
  - meeting_room_real
  - software_demo_real
Live Strix Halo validation on March 10, 2026 established:
- the real NOAA weather-radio clip separated correctly as broadcast playback
- the real applause clip separated correctly as a crowd_applause supporting cue
- the current real meeting and software-demo clips still collapsed toward produced narration in raw CLAP audio-only classification
- those exploratory real clips remain useful for benchmarking and future model or prompt work, but should not tune the default calibration path yet
CLAP calibration had a real primary-prompt-selection bug:
- primary prompt variants were not actually being rescored during calibration
- that bug is now fixed
- the packaged baseline profile has been regenerated from the four validated fixtures
exploratory CLAP fixtures now participate as negative contrast during calibration:
- they can help reject over-broad prompt choices
- they still do not count as validated positives for the packaged baseline
Current Mac-side exploratory CLAP measurements still show the same core limit:
- meeting_room_real still collapses to podcast_voiceover
- software_demo_real still leans toward podcast_voiceover
- newer screencast candidates can flip toward meeting_room_speech, but they still do not separate cleanly enough to promote software_demo
Post-benchmark idle validation remained clean on Strix Halo:
- after processing and two direct summary comparisons, the GPU returned to 0% use at roughly 608-609 MHz
Live Strix Halo duplicate handling is now in place:
- duplicate policy and thresholds are configurable through /api/settings and the Settings UI
- completed videos reject accidental reruns unless force=true is used explicitly
- exact duplicate uploads are marked in videos.duplicate_info, skip standalone indexing, and are suppressed from default search results
- explicit video_id or video_ids search and ask requests still allow targeting suppressed duplicates directly
Live Strix Halo validation on March 12, 2026 confirmed:
- two repeated uploads of the same spoken probe were both classified as exact_duplicate with score: 1.0
- default search for that probe content returned only the representative video after suppression
- /api/videos/{id} now reports an active rerun as queued or processing instead of incorrectly preferring an older completed job
Live ai-server planner validation on March 12, 2026 confirmed:
- the machine exposes 1x RTX PRO 6000 Blackwell Workstation Edition plus 6x RTX 3090
- the 6000 had about 97 GB free and is the correct first-choice placement when the current model set fits on one GPU
- the planner now uses current free memory and nvidia-smi topo -m data, not only static total VRAM
- CUDA worker-side models now honor the planner-selected device index, and local llama.cpp models can use the planner-selected CUDA main GPU or tensor split
- the NVIDIA Docker path now has a dedicated CUDA backend image with CUDA PyTorch wheels, CUDA llama.cpp, and NeMo enabled by default
- the CUDA image now keeps the llama-cpp-python vendored llama.cpp by default instead of force-swapping in a separate checkout; upstream overrides are now explicit
- the NVIDIA Compose override now uses gpus: all; the earlier swarm-style reservation alone did not expose GPUs under normal Docker Compose
- the default Compose stack now keeps PostgreSQL and Redis internal-only to avoid host port collisions during deployment
- CUDA_VISIBLE_DEVICES must stay unset by default on NVIDIA; setting it to an empty string hides all CUDA devices from PyTorch and CTranslate2
- the CUDA backend image now smoke-builds successfully on that host after switching the image to a venv install, linking against CUDA driver stubs during docker build, and setting CUDA_ARCHITECTURES=86;120
- the three NVLink-connected 3090 pairs are now recorded as fallback placement candidates when the 6000 cannot fit the active set
- live ai-server validation on March 13, 2026 confirmed the CUDA Whisper path now loads Systran/faster-whisper-large-v3 correctly instead of the incompatible raw OpenAI snapshot layout
- the next live blocker after that was pyannote credentials: when HF_TOKEN is missing, diarization now skips cleanly and the rest of the pipeline can continue
- that same live CUDA pass also exposed a stale default vision-model config: the repo default now points at Qwen3VL-32B-Instruct-Q4_K_M.gguf instead of the old placeholder qwen3-omni GGUF path
- live CUDA vision loads were not stable enough under Celery prefork; the NVIDIA worker now uses --pool=solo until the local llama.cpp path is hardened further
- local CUDA llama.cpp now narrows CUDA_VISIBLE_DEVICES to the planner-selected GPUs before first import so device numbering matches the planner on heterogeneous multi-GPU hosts
- the mixed NVIDIA path now behaves best with summarization pinned to a 3090 and vision pinned to the RTX PRO 6000
- direct CUDA llama.cpp on the RTX PRO 6000 was stable enough for text summarization with the vendored llama.cpp, but not for Qwen3VL; the NVIDIA override now routes vision through official vLLM
- live NVIDIA vision now launches the correct vLLM process on the RTX PRO 6000 and is healthy after first-start model load
- the next live CUDA blocker after endpoint startup was torchaudio pulling in an incompatible torchcodec runtime during the optional CLAP audio-event step inside summarization
- the audio-event path now reads extracted WAV files directly, and the worker now treats audio-event classification as fail-soft so summary generation can continue without audio hints when that optional stage breaks
- current accelerator sizing guidance for the shipped model set is:
  - rough floor: about 24 GB free
  - practical single-accelerator target: about 32 GB free
  - warm worker models plus one local endpoint: about 50.5 GB of budget
  - single-accelerator fully resident target: about 100 GB free at the default memory fraction
- live ai-server acceptance on March 14, 2026 now covers single upload, batch upload, summary retrieval, search, ask, and similar on the mixed 3090 + RTX PRO 6000 split
- the runtime planner budget on large multi-GPU hosts is intentionally computed from currently free accelerator memory, while the Settings UI now shows installed accelerator memory separately so the two numbers are not conflated
- the managed NVIDIA endpoint runtime must treat NVIDIA_VISION_VISIBLE_DEVICES and NVIDIA_SUMMARIZATION_VISIBLE_DEVICES as role-specific; reading the wrong one can pin summarization onto the vision GPU and deadlock the pipeline
- endpoint startup retries are now bounded so a dead 503 Loading model loop fails instead of hanging jobs for many minutes
- embedding indexing now sanitizes Chroma metadata values before insert so malformed nested slide/topic metadata cannot fail the whole job at the final embedding step
- both ask responses and generated summaries now strip <think>...</think> reasoning blocks before returning or persisting model output
- the runtime plan now exposes explicit worker execution mode, endpoint loading mode, per-model placement, and whether endpoint services should unload after each request
- managed NVIDIA endpoints now auto-pick GPUs from current nvidia-smi free-memory data when explicit pins are unset
- that NVIDIA auto-pick path now prefers the emptiest GPU that can fit the requested model instead of drifting onto a busier larger card with only slightly more free VRAM
- older vLLM 0.11.2 vision images were too old to recognize qwen3_5; the default NVIDIA vision image now tracks v0.17.1
- on a single smaller NVIDIA GPU, managed endpoints now switch to stage-by-stage loading instead of trying to keep both large endpoint models resident
- Apple Silicon now auto-detects as hardware_profile=apple_silicon with gpu_backend=metal
- the initial Apple bring-up path uses smaller default models on unified-memory Macs:
  - small Whisper
  - nomic-ai/nomic-embed-text-v1.5
  - Qwen2.5-VL-3B-Instruct.Q4_K_M.gguf
  - Qwen2.5-3B-Instruct.Q4_K_M.gguf
- on smaller Apple unified-memory systems, the planner now forces full stage-by-stage loading instead of trying to keep a hybrid resident set hot
- the repo bootstrap scripts now detect Apple Silicon and stop recommending the old Vulkan Strix-Halo path
- live Mac mini validation on March 15, 2026 exposed three host-run requirements:
  - default Apple host paths should be repo-local data/ directories instead of /data/...
  - the Apple worker should use Celery --pool=solo
  - the Metal Whisper transformers path needs the soundfile dependency installed
- Ask mode now behaves as a real follow-up chat:
  - /api/search/ask accepts optional prior turns
  - retrieval expands follow-up queries with recent user turns
  - the Search UI keeps a local conversation thread instead of replacing the last answer
- Summary action items can now be promoted into a dedicated todo list:
  - /api/action-items supports create/update/delete/list
  - list filtering can use linked video labels
  - the video detail page can add summary suggestions or manual todos directly
  - the /todos page is the global check-off view
- Settings now exposes model selectors for all main runtime model slots plus a verify/add flow:
  - built-in catalog entries can be reselected directly
  - custom candidates can be checked against the built-in GGUF registry or Hugging Face before being added
- Similar-video ranking now weights transcript and key-point overlap more heavily than generic summary tone, which reduces false neighbors for generic narrated videos

High Priority Requirements

Keep AMD/Strix Halo fully functional on ROCm unless CPU execution is benchmarked to be equally fast for the exact stage and model in question.
Run a full end-to-end security review across API, frontend, model/runtime orchestration, remote endpoints, secrets handling, uploads, search/ask flows, and deployment surfaces so the product is defensible as a secure local-first system.
Security enhancement ranking after the March 16, 2026 pass:
- High:
  - add per-user auth/RBAC if the product needs multi-user or shared-host deployments; the current model is still intentionally single-admin even with OIDC
  - add stricter export/search quotas once real production traffic patterns are known
  - complete the remaining end-to-end security review against the newly secured live targets and deployment docs
- Medium:
  - add stronger reverse-proxy/web-server headers, especially CSP, if the app is fronted by a production ingress
  - add optional session/device management views if operators need to revoke other active sessions
  - review artifact-retention controls for uploads, transcripts, slides, embeddings, and exports beyond the new audit-log retention
- Low:
  - tighten origin defaults further for named-host deployments beyond localhost/private-network use
  - add optional MFA only if the product moves beyond the current trusted local-admin model
Standardize the Strix Halo host baseline around a known-good ROCm/kernel combination and verify it during live acceptance so ROCm regressions are caught outside the app code too.
Replace the current fixed model-loading behavior with a dynamic residency planner that:
- detects GPU count, memory, and topology automatically
- determines whether models can remain resident instead of unloading between stages
- respects a configurable memory ceiling so the runtime stays under a user-defined VRAM or unified-memory budget
- decides when models should be pinned to one GPU, shared across GPUs, or split across GPUs
Reduce AMD cold-start overhead for large embedding models so the final embedding stage does not dominate batch tail latency.
Keep live validation on the Strix Halo as part of the acceptance bar, including small real video fixtures.

Active Functional Work

Replace automatic transcription fallback with an explicit ASR model switcher or selector so one chosen ASR path runs deterministically instead of silently falling back to Whisper.
Add full NVIDIA runtime support after the AMD ROCm path is solid.
Live-validate the new macOS Metal bring-up path on an actual Apple Silicon machine.

Product And Pipeline Enhancements

Keep sourcing better real meeting-room and software-demo fixtures until CLAP can separate them cleanly enough to promote them from exploratory to validated calibration inputs.
Add external sync/export for the action-item system once the target integration format is decided.
Let the summarization model reconcile CLAP-derived audio hints with transcript, slides, and vision context, but do not rely on the transcription model alone for audio-scene hints because ASR text drops nonverbal audio information.
Keep benchmarking whether CLAP-derived audio context materially improves summaries enough to justify the extra prompt and latency cost on Strix Halo and NVIDIA.
Incremental recomputation with step-level caching and lineage or versioning for transcripts, frames, vision metadata, and summaries.
Multi-index retrieval routing (transcript, slides, vision) with reranking for better long-video QA.
Optional live transcription plus audio capture or mixing mode for real-time meetings and demos.
Log clarity: show the actual ASR backend instead of the generic whisper label.
Ensure UI and logs reflect Canary when the ASR backend is Canary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Status And Backlog

Current Operating Position

Validated Findings And Constraints

High Priority Requirements

Active Functional Work

Product And Pipeline Enhancements

FilesExpand file tree

project_status_and_backlog.md

Latest commit

History

project_status_and_backlog.md

File metadata and controls

Project Status And Backlog

Current Operating Position

Validated Findings And Constraints

High Priority Requirements

Active Functional Work

Product And Pipeline Enhancements