Skip to content

Latest commit

 

History

History
197 lines (188 loc) · 18.3 KB

File metadata and controls

197 lines (188 loc) · 18.3 KB

Project Status And Backlog

Current Operating Position

  • Strix Halo is required to run ROCm-only for all inference stages. CPU fallback is not an acceptable normal path there.
  • The live Strix Halo default should remain llama_server for now.
  • The vllm runtime stays in the repo as an opt-in ROCm backend for future work and for later NVIDIA comparison.
  • Live validation on the Strix Halo remains part of the acceptance bar for runtime changes.
  • Apple Silicon now has an initial Metal host-run bring-up path with small defaults, and the active live validation target is the local 16 GB Mac mini.
  • scripts/acceptance.sh now exists as the repeatable API-level acceptance entry point for local Mac runs, ai-server, and read-only Strix Halo checks, including runtime settings/hardware and action-items CRUD/filter coverage.
  • Action items now have a first-class todo system with /api/action-items, a dedicated /todos page, and a Settings toggle for feature visibility.
  • The March 16, 2026 security pass now adds:
    • single-admin auth with bootstrap setup, login/logout, and password rotation
    • base OIDC support for providers like Authentik, including auth-code login, callback, allowlists, and reuse of the existing local session cookies
    • CSRF on mutating routes
    • auth/write/upload rate limits
    • JSON request-size ceilings on non-upload API writes
    • settings-side custom endpoint/model validation
    • audit logging with retention cleanup
    • forwarded-header trust is now opt-in, localhost bootstrap no longer trusts spoofed X-Forwarded-For, cross-origin public auth posts are rejected, and non-loopback HTTP auth is blocked by default
    • the AMD managed vision/summarization runtimes are no longer published on host ports by default
  • The earlier March 15, 2026 security pass fixed the low-friction network/upload leaks:
    • browser CORS was narrowed from wildcard-open to explicit origins plus local/private-network browser origins
    • job WebSockets now apply the same browser-origin check
    • uploads now enforce size while streaming and reject files that fail media probing
    • summary responses no longer leak absolute slide image filesystem paths

Validated Findings And Constraints

  • Strix Halo currently behaves best with the managed ROCm llama_server path plus idle teardown.
  • GPU idle must be validated with clocks and power, not only rocm-smi --showuse.
  • The original "GPU pinned at 100%" issue on Strix Halo was real with persistent ROCm model-server processes; managed teardown resolved it for the llama_server path.
  • vllm on Strix Halo gfx1151 is not a good default today:
    • the FP8 model path is not viable on this GPU family
    • the BF16 path can boot on ROCm, but startup is much slower
    • large vision and summarization engines contend heavily for memory
    • child engine processes must be torn down as a process group or VRAM remains pinned
  • On Strix Halo, vllm currently looks more like an experimental path than a production default.
  • For this hardware, the practical tradeoff today is:
    • llama_server: slower cold loads, but operationally reliable
    • vllm: promising future path, but currently too heavy and fragile for default use
  • The CLAP audio-event path is now live and is producing a single structured primary context for summarization instead of a noisy flat hint list.
  • Live Strix Halo benchmark on March 10, 2026:
    • CLAP classified a narrated demo clip as produced narration or voice-over with high confidence
    • the summary with audio context added explicit narration context and one extra evidence-based key point compared with the no-audio version
    • the same summary call took about 27.3s with audio context versus 13.7s without it on a warm summarization endpoint
  • Current CLAP limitation from that benchmark:
    • a light synthetic music bed did not clear the supporting-cue threshold, so soundtrack sensitivity still needs calibration
  • Follow-up live Strix Halo benchmark on March 10, 2026 after the automatic support-scoring pass:
    • the same narrated benchmark clip now emits both produced narration or voice-over and noticeable music bed or soundtrack
    • the saved audio-event artifact shows soundtrack support scoring around 0.94 for that clip
    • the post-run GPU idle check still returned to 0% use at about 607-609 MHz
  • The repo now includes a checked-in CLAP baseline fixture pack and packaged baseline profile:
    • fixture manifest: backend/tests/fixtures/audio_calibration/manifest.json
    • packaged runtime baseline: backend/app/assets/audio_event_calibration.json
    • custom AUDIO_EVENT_CALIBRATION_PATH still overrides the packaged baseline when present
  • The checked-in CLAP fixture pack now distinguishes validated calibration fixtures from exploratory real clips:
    • validated calibration path:
      • voiceover_no_music
      • voiceover_with_music
      • broadcast_weather_radio
      • applause_real
    • exploratory fixtures kept out of default calibration for now:
      • meeting_room_real
      • software_demo_real
  • Live Strix Halo validation on March 10, 2026 established:
    • the real NOAA weather-radio clip separated correctly as broadcast playback
    • the real applause clip separated correctly as a crowd_applause supporting cue
    • the current real meeting and software-demo clips still collapsed toward produced narration in raw CLAP audio-only classification
    • those exploratory real clips remain useful for benchmarking and future model or prompt work, but should not tune the default calibration path yet
  • CLAP calibration had a real primary-prompt-selection bug:
    • primary prompt variants were not actually being rescored during calibration
    • that bug is now fixed
    • the packaged baseline profile has been regenerated from the four validated fixtures
  • exploratory CLAP fixtures now participate as negative contrast during calibration:
    • they can help reject over-broad prompt choices
    • they still do not count as validated positives for the packaged baseline
  • Current Mac-side exploratory CLAP measurements still show the same core limit:
    • meeting_room_real still collapses to podcast_voiceover
    • software_demo_real still leans toward podcast_voiceover
    • newer screencast candidates can flip toward meeting_room_speech, but they still do not separate cleanly enough to promote software_demo
  • Post-benchmark idle validation remained clean on Strix Halo:
    • after processing and two direct summary comparisons, the GPU returned to 0% use at roughly 608-609 MHz
  • Live Strix Halo duplicate handling is now in place:
    • duplicate policy and thresholds are configurable through /api/settings and the Settings UI
    • completed videos reject accidental reruns unless force=true is used explicitly
    • exact duplicate uploads are marked in videos.duplicate_info, skip standalone indexing, and are suppressed from default search results
    • explicit video_id or video_ids search and ask requests still allow targeting suppressed duplicates directly
  • Live Strix Halo validation on March 12, 2026 confirmed:
    • two repeated uploads of the same spoken probe were both classified as exact_duplicate with score: 1.0
    • default search for that probe content returned only the representative video after suppression
    • /api/videos/{id} now reports an active rerun as queued or processing instead of incorrectly preferring an older completed job
  • Live ai-server planner validation on March 12, 2026 confirmed:
    • the machine exposes 1x RTX PRO 6000 Blackwell Workstation Edition plus 6x RTX 3090
    • the 6000 had about 97 GB free and is the correct first-choice placement when the current model set fits on one GPU
    • the planner now uses current free memory and nvidia-smi topo -m data, not only static total VRAM
    • CUDA worker-side models now honor the planner-selected device index, and local llama.cpp models can use the planner-selected CUDA main GPU or tensor split
    • the NVIDIA Docker path now has a dedicated CUDA backend image with CUDA PyTorch wheels, CUDA llama.cpp, and NeMo enabled by default
    • the CUDA image now keeps the llama-cpp-python vendored llama.cpp by default instead of force-swapping in a separate checkout; upstream overrides are now explicit
    • the NVIDIA Compose override now uses gpus: all; the earlier swarm-style reservation alone did not expose GPUs under normal Docker Compose
    • the default Compose stack now keeps PostgreSQL and Redis internal-only to avoid host port collisions during deployment
    • CUDA_VISIBLE_DEVICES must stay unset by default on NVIDIA; setting it to an empty string hides all CUDA devices from PyTorch and CTranslate2
    • the CUDA backend image now smoke-builds successfully on that host after switching the image to a venv install, linking against CUDA driver stubs during docker build, and setting CUDA_ARCHITECTURES=86;120
    • the three NVLink-connected 3090 pairs are now recorded as fallback placement candidates when the 6000 cannot fit the active set
    • live ai-server validation on March 13, 2026 confirmed the CUDA Whisper path now loads Systran/faster-whisper-large-v3 correctly instead of the incompatible raw OpenAI snapshot layout
    • the next live blocker after that was pyannote credentials: when HF_TOKEN is missing, diarization now skips cleanly and the rest of the pipeline can continue
    • that same live CUDA pass also exposed a stale default vision-model config: the repo default now points at Qwen3VL-32B-Instruct-Q4_K_M.gguf instead of the old placeholder qwen3-omni GGUF path
    • live CUDA vision loads were not stable enough under Celery prefork; the NVIDIA worker now uses --pool=solo until the local llama.cpp path is hardened further
    • local CUDA llama.cpp now narrows CUDA_VISIBLE_DEVICES to the planner-selected GPUs before first import so device numbering matches the planner on heterogeneous multi-GPU hosts
    • the mixed NVIDIA path now behaves best with summarization pinned to a 3090 and vision pinned to the RTX PRO 6000
    • direct CUDA llama.cpp on the RTX PRO 6000 was stable enough for text summarization with the vendored llama.cpp, but not for Qwen3VL; the NVIDIA override now routes vision through official vLLM
    • live NVIDIA vision now launches the correct vLLM process on the RTX PRO 6000 and is healthy after first-start model load
    • the next live CUDA blocker after endpoint startup was torchaudio pulling in an incompatible torchcodec runtime during the optional CLAP audio-event step inside summarization
    • the audio-event path now reads extracted WAV files directly, and the worker now treats audio-event classification as fail-soft so summary generation can continue without audio hints when that optional stage breaks
    • current accelerator sizing guidance for the shipped model set is:
      • rough floor: about 24 GB free
      • practical single-accelerator target: about 32 GB free
      • warm worker models plus one local endpoint: about 50.5 GB of budget
      • single-accelerator fully resident target: about 100 GB free at the default memory fraction
    • live ai-server acceptance on March 14, 2026 now covers single upload, batch upload, summary retrieval, search, ask, and similar on the mixed 3090 + RTX PRO 6000 split
    • the runtime planner budget on large multi-GPU hosts is intentionally computed from currently free accelerator memory, while the Settings UI now shows installed accelerator memory separately so the two numbers are not conflated
    • the managed NVIDIA endpoint runtime must treat NVIDIA_VISION_VISIBLE_DEVICES and NVIDIA_SUMMARIZATION_VISIBLE_DEVICES as role-specific; reading the wrong one can pin summarization onto the vision GPU and deadlock the pipeline
    • endpoint startup retries are now bounded so a dead 503 Loading model loop fails instead of hanging jobs for many minutes
    • embedding indexing now sanitizes Chroma metadata values before insert so malformed nested slide/topic metadata cannot fail the whole job at the final embedding step
    • both ask responses and generated summaries now strip <think>...</think> reasoning blocks before returning or persisting model output
    • the runtime plan now exposes explicit worker execution mode, endpoint loading mode, per-model placement, and whether endpoint services should unload after each request
    • managed NVIDIA endpoints now auto-pick GPUs from current nvidia-smi free-memory data when explicit pins are unset
    • that NVIDIA auto-pick path now prefers the emptiest GPU that can fit the requested model instead of drifting onto a busier larger card with only slightly more free VRAM
    • older vLLM 0.11.2 vision images were too old to recognize qwen3_5; the default NVIDIA vision image now tracks v0.17.1
    • on a single smaller NVIDIA GPU, managed endpoints now switch to stage-by-stage loading instead of trying to keep both large endpoint models resident
    • Apple Silicon now auto-detects as hardware_profile=apple_silicon with gpu_backend=metal
    • the initial Apple bring-up path uses smaller default models on unified-memory Macs:
      • small Whisper
      • nomic-ai/nomic-embed-text-v1.5
      • Qwen2.5-VL-3B-Instruct.Q4_K_M.gguf
      • Qwen2.5-3B-Instruct.Q4_K_M.gguf
    • on smaller Apple unified-memory systems, the planner now forces full stage-by-stage loading instead of trying to keep a hybrid resident set hot
    • the repo bootstrap scripts now detect Apple Silicon and stop recommending the old Vulkan Strix-Halo path
    • live Mac mini validation on March 15, 2026 exposed three host-run requirements:
      • default Apple host paths should be repo-local data/ directories instead of /data/...
      • the Apple worker should use Celery --pool=solo
      • the Metal Whisper transformers path needs the soundfile dependency installed
    • Ask mode now behaves as a real follow-up chat:
      • /api/search/ask accepts optional prior turns
      • retrieval expands follow-up queries with recent user turns
      • the Search UI keeps a local conversation thread instead of replacing the last answer
    • Summary action items can now be promoted into a dedicated todo list:
      • /api/action-items supports create/update/delete/list
      • list filtering can use linked video labels
      • the video detail page can add summary suggestions or manual todos directly
      • the /todos page is the global check-off view
    • Settings now exposes model selectors for all main runtime model slots plus a verify/add flow:
      • built-in catalog entries can be reselected directly
      • custom candidates can be checked against the built-in GGUF registry or Hugging Face before being added
    • Similar-video ranking now weights transcript and key-point overlap more heavily than generic summary tone, which reduces false neighbors for generic narrated videos

High Priority Requirements

  • Keep AMD/Strix Halo fully functional on ROCm unless CPU execution is benchmarked to be equally fast for the exact stage and model in question.
  • Run a full end-to-end security review across API, frontend, model/runtime orchestration, remote endpoints, secrets handling, uploads, search/ask flows, and deployment surfaces so the product is defensible as a secure local-first system.
  • Security enhancement ranking after the March 16, 2026 pass:
    • High:
      • add per-user auth/RBAC if the product needs multi-user or shared-host deployments; the current model is still intentionally single-admin even with OIDC
      • add stricter export/search quotas once real production traffic patterns are known
      • complete the remaining end-to-end security review against the newly secured live targets and deployment docs
    • Medium:
      • add stronger reverse-proxy/web-server headers, especially CSP, if the app is fronted by a production ingress
      • add optional session/device management views if operators need to revoke other active sessions
      • review artifact-retention controls for uploads, transcripts, slides, embeddings, and exports beyond the new audit-log retention
    • Low:
      • tighten origin defaults further for named-host deployments beyond localhost/private-network use
      • add optional MFA only if the product moves beyond the current trusted local-admin model
  • Standardize the Strix Halo host baseline around a known-good ROCm/kernel combination and verify it during live acceptance so ROCm regressions are caught outside the app code too.
  • Replace the current fixed model-loading behavior with a dynamic residency planner that:
    • detects GPU count, memory, and topology automatically
    • determines whether models can remain resident instead of unloading between stages
    • respects a configurable memory ceiling so the runtime stays under a user-defined VRAM or unified-memory budget
    • decides when models should be pinned to one GPU, shared across GPUs, or split across GPUs
  • Reduce AMD cold-start overhead for large embedding models so the final embedding stage does not dominate batch tail latency.
  • Keep live validation on the Strix Halo as part of the acceptance bar, including small real video fixtures.

Active Functional Work

  • Replace automatic transcription fallback with an explicit ASR model switcher or selector so one chosen ASR path runs deterministically instead of silently falling back to Whisper.
  • Add full NVIDIA runtime support after the AMD ROCm path is solid.
  • Live-validate the new macOS Metal bring-up path on an actual Apple Silicon machine.

Product And Pipeline Enhancements

  • Keep sourcing better real meeting-room and software-demo fixtures until CLAP can separate them cleanly enough to promote them from exploratory to validated calibration inputs.
  • Add external sync/export for the action-item system once the target integration format is decided.
  • Let the summarization model reconcile CLAP-derived audio hints with transcript, slides, and vision context, but do not rely on the transcription model alone for audio-scene hints because ASR text drops nonverbal audio information.
  • Keep benchmarking whether CLAP-derived audio context materially improves summaries enough to justify the extra prompt and latency cost on Strix Halo and NVIDIA.
  • Incremental recomputation with step-level caching and lineage or versioning for transcripts, frames, vision metadata, and summaries.
  • Multi-index retrieval routing (transcript, slides, vision) with reranking for better long-video QA.
  • Optional live transcription plus audio capture or mixing mode for real-time meetings and demos.
  • Log clarity: show the actual ASR backend instead of the generic whisper label.
  • Ensure UI and logs reflect Canary when the ASR backend is Canary.