Skip to content

Commit fc09b4c

Browse files
authored
Merge pull request #13 from msitarzewski/question-refinement
Question refinement, native web search, citations, streaming
2 parents 3a6dfcf + c9154bb commit fc09b4c

53 files changed

Lines changed: 2651 additions & 221 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

memory-bank/activeContext.md

Lines changed: 54 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,72 @@
11
# Active Context
22

3-
**Last Updated**: 2026-03-07
4-
**Current Phase**: Post v0.6.0z-index fix + GPT-5.4 + .env improvements
5-
**Next Action**: PR ready for review
3+
**Last Updated**: 2026-03-08
4+
**Current Phase**: `question-refinement` branchpre-consensus question refinement, native web search, citations, tools-by-default
5+
**Next Action**: Branch in progress, uncommitted changes staged
66

7-
## Latest Work (2026-03-07)
7+
## Latest Work (2026-03-08)
88

9-
### Z-index stacking context fix
10-
- **Problem**: Nested stacking contexts (`z-10` on main content, `z-20` on TopBar header) trapped dropdowns inside containers. Account menu's `fixed inset-0 z-40` backdrop was meaningless outside its container.
11-
- **Fix**: Removed unnecessary z-index values creating stacking contexts, added `isolate` to Shell root, defined z-index tokens in CSS (`--z-background`, `--z-dropdown`, `--z-overlay`, `--z-modal`), replaced backdrop hack with `useRef` + `mousedown` click-outside pattern (matching ExportMenu).
12-
- Files: `duh-theme.css`, `Shell.tsx`, `TopBar.tsx`, `GridOverlay.tsx`, `ParticleField.tsx`, `ExportMenu.tsx`, `ConsensusComplete.tsx`, `ThreadDetail.tsx`
9+
### Question Refinement
10+
- Pre-consensus clarification step: analyze question → ask clarifying questions → enrich with answers → proceed to consensus
11+
- `src/duh/consensus/refine.py``analyze_question()` + `enrich_question()`, uses MOST EXPENSIVE model (not cheapest)
12+
- API: `POST /api/refine``RefineResponse{needs_refinement, questions[]}`, `POST /api/enrich``EnrichResponse{enriched_question}`
13+
- CLI: `duh ask --refine "question"` — interactive `click.prompt()` loop, default `--no-refine`
14+
- Frontend: consensus store `'refining'` status, `submitQuestion` → refine → clarify → enrich → `startConsensus`
15+
- `RefinementPanel.tsx` — tabbed UI inside GlassPanel, checkmarks on answered tabs, Skip + Start Consensus buttons
16+
- Graceful fallback: any failure → proceed to consensus with original question
1317

14-
### GPT-5.4 added to model catalog
15-
- `gpt-5.4`: 1M context, 128K output, $2.50/$15.00 per MTok, no temperature (uses reasoning.effort)
16-
- Added to `NO_TEMPERATURE_MODELS` set
17-
- File: `src/duh/providers/catalog.py`
18+
### Native Provider Web Search
19+
- Providers use server-side search instead of DDG proxy when `config.tools.web_search.native` is true
20+
- `web_search: bool` param added to `ModelProvider.send()` protocol
21+
- Anthropic: `web_search_20250305` server tool in tools[]
22+
- Google: `GoogleSearch()` grounding (replaces function tools — can't coexist)
23+
- Mistral: `{"type": "web_search"}` appended to tools
24+
- OpenAI: `web_search_options={}` only for `_SEARCH_MODELS` set; others fall back to DDG
25+
- Perplexity: no-op (always searches natively)
26+
- `tool_augmented_send`: filters DDG `web_search` tool when native=True, passes flag to provider
1827

19-
### .env improvements
20-
- Added provider API key placeholders to `.env.example` (ANTHROPIC, OPENAI, GOOGLE, PERPLEXITY, MISTRAL)
21-
- Updated README quick start with all provider env vars + `.env` reference
22-
- Note: Google key env var is `GOOGLE_API_KEY` (not `GEMINI_API_KEY`)
28+
### Citations — Persisted + Domain-Grouped
29+
- `Citation` dataclass (url, title, snippet) on `ModelResponse.citations`
30+
- Extraction per provider: Anthropic (`web_search_tool_result`), Google (grounding metadata), Perplexity (`response.citations`)
31+
- **Persistence**: `citations_json` TEXT column on `Contribution` model, SQLite auto-migration via `ensure_schema()`
32+
- `proposal_citations` tracked on `ConsensusContext` → archived to `RoundResult` → persisted via `_persist_consensus`
33+
- Thread detail API returns `citations` on `ContributionResponse`
34+
- **Domain-grouped Sources nav**: ConsensusNav (live) + ThreadNav (stored) group citations by hostname
35+
- Nested Disclosure: outer "Sources (17)" → inner "wikipedia.org (3)" → P/C/R role badges per citation
36+
- P (green) = propose, C (amber) = challenge, R (blue) = revise
37+
- `CitationList` shared component for inline display below content
38+
39+
### Anthropic Streaming + max_tokens
40+
- `AnthropicProvider.send()` now uses streaming internally via `_collect_stream()` — avoids 10-minute timeout
41+
- `max_tokens` bumped from 16384 → 32768 across all 6 handler defaults (propose, challenge, revise, commit, voting, decomposition)
42+
- Citations are part of the value — truncating them undermines trust
43+
44+
### Parallel Challenge Streaming
45+
- `_stream_challenges()` in `ws.py` uses `asyncio.as_completed()` to send each challenge result to the frontend as it finishes
46+
- Previously: all challengers ran in parallel but results were batched after all completed
47+
- Now: first challenger to respond appears immediately in the UI
48+
49+
### Tools Enabled by Default
50+
- `web_search` tool wired through CLI, REST, and WebSocket paths by default
51+
- Provider tool format fix: `tool_augmented_send` builds generic `{name, description, parameters}` — each provider transforms to native format in `send()`
52+
53+
### Sidebar UX
54+
- New-question button (Heroicons pencil-square) + collapsible sidebar toggle
55+
- Shell manages `desktopSidebarOpen` (default true) + `mobileSidebarOpen` separately
56+
- TopBar shows sidebar toggle when desktop sidebar collapsed or always on mobile
2357

2458
### Test Results
25-
- 1603 Python tests + 185 Vitest tests (1788 total)
59+
- 1641 Python tests + 194 Vitest tests (1835 total)
2660
- Build clean, all tests pass
2761

2862
---
2963

3064
## Current State
3165

32-
- **Branch `ux-cleanup`**z-index fix, GPT-5.4, .env docs
33-
- **1603 Python tests + 185 Vitest tests** (1788 total)
66+
- **Branch `question-refinement`**in progress, not yet merged
67+
- **1641 Python tests + 194 Vitest tests** (1835 total)
3468
- All previous features intact (v0.1–v0.6)
69+
- Prior work merged: z-index fix, GPT-5.4, .env docs, password reset
3570

3671
## Open Questions (Still Unresolved)
3772

memory-bank/decisions.md

Lines changed: 85 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Architectural Decisions
22

3-
**Last Updated**: 2026-02-18
3+
**Last Updated**: 2026-03-08
44

55
---
66

@@ -354,3 +354,87 @@
354354
- Manual migration instructions in docs (user friction)
355355
**Consequences**: File-based SQLite databases auto-migrate on startup. Zero friction for local users. PostgreSQL still requires `alembic upgrade head`. Lightweight and self-contained.
356356
**References**: `src/duh/memory/migrations.py`, `src/duh/cli/app.py:107-110`
357+
358+
---
359+
360+
## 2026-03-08: Native Provider Web Search Over DDG Proxy
361+
362+
**Status**: Approved
363+
**Context**: The original web search tool used DuckDuckGo as a proxy — every provider's tool calls went through DDG, which returned index pages rather than real content. Most major providers now offer server-side web search that returns higher-quality results with citations.
364+
**Decision**: Add `web_search: bool` parameter to the `ModelProvider.send()` protocol. When `config.tools.web_search.native` is true, each provider uses its native search capability: Anthropic (`web_search_20250305` server tool), Google (`GoogleSearch()` grounding), Mistral (`{"type": "web_search"}`), OpenAI (`web_search_options`), Perplexity (always native). DDG proxy remains as fallback for providers/models that don't support native search.
365+
**Alternatives**:
366+
- DDG-only (simpler, but returns low-quality index pages instead of real content)
367+
- Single search provider for all (e.g., Bing API — adds external dependency and API key)
368+
- Remove web search entirely (loses grounding capability)
369+
**Consequences**: Higher quality search results with real content. Citations extractable from provider responses. Each provider has different native search API shape — increases per-provider complexity. Google grounding and function declarations can't coexist (grounding replaces function tools).
370+
**References**: `src/duh/providers/anthropic.py`, `src/duh/providers/google.py`, `src/duh/providers/mistral.py`, `src/duh/providers/openai.py`, `src/duh/tools/augmented_send.py`
371+
372+
---
373+
374+
## 2026-03-08: Question Refinement Uses Most Expensive Model
375+
376+
**Status**: Approved
377+
**Context**: Question refinement analyzes user questions before consensus to determine if clarification is needed. The analysis quality directly impacts downstream consensus quality — a poorly refined question wastes all subsequent model calls.
378+
**Decision**: `analyze_question()` and `enrich_question()` in `src/duh/consensus/refine.py` use the most expensive configured model (sorted by cost), not the cheapest. The refinement step is a single model call, so the cost difference is minimal compared to the full consensus round it precedes.
379+
**Alternatives**:
380+
- Cheapest model (saves tokens, but poor analysis leads to poor consensus)
381+
- User-configurable refinement model (adds UX complexity)
382+
- Multi-model refinement (overkill — single strong model is sufficient for question analysis)
383+
**Consequences**: Better question analysis quality. Marginal cost increase (one extra expensive model call). Graceful fallback on failure — original question proceeds to consensus unchanged.
384+
**References**: `src/duh/consensus/refine.py`, `src/duh/api/routes/ask.py`, `src/duh/cli/app.py`
385+
386+
---
387+
388+
## 2026-03-08: Tools Enabled by Default
389+
390+
**Status**: Approved
391+
**Context**: Web search was originally opt-in. Users who didn't know about the `--tools` flag got ungrounded responses. Most queries benefit from web search grounding.
392+
**Decision**: `web_search` tool is enabled by default across CLI, REST API, and WebSocket paths. The `config.tools.web_search` section controls behavior. Native provider search is preferred when available.
393+
**Alternatives**:
394+
- Opt-in only (simpler, but most users miss it)
395+
- Always-on with no config (inflexible for cost-sensitive users)
396+
- Per-question tool selection (too much UX friction)
397+
**Consequences**: Better default experience — responses are grounded in current information. Slightly higher cost per query (search tool calls). Users can disable via config if needed.
398+
**References**: `src/duh/config/schema.py`, `src/duh/cli/app.py`, `src/duh/api/routes/ws.py`
399+
400+
---
401+
402+
## 2026-03-08: Citation Persistence on Contributions
403+
404+
**Status**: Approved
405+
**Context**: Citations were emitted over WebSocket during live consensus but never persisted. Viewing a thread later from the Threads section showed no sources — undermining the trust value of native web search.
406+
**Decision**: Add `citations_json` TEXT column to the `Contribution` model (nullable, JSON-encoded list of `{url, title}`). Track `proposal_citations` on `ConsensusContext` and archive to `RoundResult`. Serialize and persist during `_persist_consensus`. Thread detail API returns parsed citations on `ContributionResponse`. ThreadNav shows domain-grouped sources matching ConsensusNav.
407+
**Alternatives**:
408+
- Separate Citation table with FK to Contribution (more normalized, but adds query complexity for marginal benefit)
409+
- Store citations only on Decision (loses per-role attribution)
410+
- Don't persist (simpler, but citations are essential to trust)
411+
**Consequences**: Citations survive beyond the WebSocket session. Thread detail view shows sources grouped by domain with role attribution (P/C/R). SQLite auto-migration handles existing databases. Slightly larger DB rows due to JSON text.
412+
**References**: `src/duh/memory/models.py:146`, `src/duh/api/routes/threads.py`, `src/duh/api/routes/ws.py`, `web/src/components/threads/ThreadNav.tsx`
413+
414+
---
415+
416+
## 2026-03-08: Anthropic Streaming Internally in send()
417+
418+
**Status**: Approved
419+
**Context**: Increasing `max_tokens` to 32768 triggered Anthropic SDK's 10-minute timeout error: "Streaming is required for operations that may take longer than 10 minutes." The `send()` method used non-streaming `messages.create()`.
420+
**Decision**: `send()` now calls `_collect_stream()` which uses `messages.stream()` as a context manager and collects the final `Message` via `get_final_message()`. The returned object is identical to `messages.create()` output, so all downstream parsing (citations, tool calls, text concatenation) works unchanged.
421+
**Alternatives**:
422+
- Keep non-streaming and lower max_tokens (loses citation content to truncation)
423+
- Full streaming to frontend (larger change, separate concern)
424+
- Increase Anthropic client timeout (fragile, doesn't scale)
425+
**Consequences**: No more timeout errors at any max_tokens value. Test mocks must mock `messages.stream` context manager instead of `messages.create`. Marginal latency increase from stream overhead (negligible vs network time).
426+
**References**: `src/duh/providers/anthropic.py:222-229`
427+
428+
---
429+
430+
## 2026-03-08: Parallel Challenge Streaming via as_completed
431+
432+
**Status**: Approved
433+
**Context**: Challengers were already running in parallel via `asyncio.gather` in `handle_challenge`, but the WebSocket handler sent all results after ALL challengers finished. Users saw nothing until the slowest challenger responded.
434+
**Decision**: New `_stream_challenges()` function in `ws.py` uses `asyncio.as_completed()` to send each challenge result to the frontend immediately as each completes. Builds `ChallengeResult` objects and updates `ctx.challenges` directly, bypassing `handle_challenge`.
435+
**Alternatives**:
436+
- Keep batched approach (simpler, but poor UX — users wait for slowest model)
437+
- Token-level streaming per challenger (much more complex, requires protocol changes)
438+
- Sequential challengers (defeats the purpose of multi-model)
439+
**Consequences**: First challenger to respond appears immediately. More engaging real-time experience. WS test mocks now patch `_stream_challenges` instead of `handle_challenge`. Challenge order in UI reflects completion speed, not configuration order.
440+
**References**: `src/duh/api/routes/ws.py:253-347`, `tests/unit/test_api_ws.py`

memory-bank/progress.md

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,36 @@
44

55
---
66

7-
## Current State: v0.6.0 — "It's Honest" COMPLETE
7+
## Current State: Post v0.6.0 — `question-refinement` Branch In Progress
8+
9+
### Question Refinement + Native Web Search + Citations (2026-03-08)
10+
11+
- **Question refinement**: pre-consensus clarification step (analyze → clarify → enrich → consensus)
12+
- `src/duh/consensus/refine.py`, API routes (`/api/refine`, `/api/enrich`), CLI `--refine` flag
13+
- Frontend: `RefinementPanel.tsx` tabbed UI, consensus store `'refining'` status
14+
- Graceful fallback on failure → original question proceeds to consensus
15+
- **Native provider web search**: Anthropic/Google/Mistral/OpenAI/Perplexity use server-side search
16+
- `web_search: bool` param on `ModelProvider.send()` protocol
17+
- `config.tools.web_search.native` flag controls behavior
18+
- DDG proxy still available as fallback for non-native providers
19+
- **Citations**: `Citation` dataclass on `ModelResponse`, extracted per provider, displayed in frontend
20+
- `CitationList` shared component, `ConsensusNav` collapsible Sources sidebar section
21+
- WS events include `citations` array for PROPOSE and CHALLENGE phases
22+
- **Tools enabled by default**: `web_search` (DuckDuckGo) wired through all paths (CLI, REST, WS)
23+
- **Provider tool format fix**: each provider transforms generic tool defs to native API format
24+
- **Sidebar UX**: new-question button + collapsible sidebar toggle
25+
- **Citation persistence**: `citations_json` on Contribution model, SQLite migration, thread detail API returns citations
26+
- **Domain-grouped Sources**: ConsensusNav + ThreadNav group citations by hostname with Disclosure, P/C/R role badges
27+
- **Anthropic streaming**: `send()` uses `_collect_stream()` internally to avoid 10-min timeout on large max_tokens
28+
- **Parallel challenge streaming**: `_stream_challenges()` sends each result to frontend as it completes via `asyncio.as_completed`
29+
- **max_tokens 32768**: bumped from 16384 across all handlers — citations are essential to trust
30+
- 1641 Python tests + 194 Vitest tests (1835 total), build clean
31+
32+
### Z-index Fix + GPT-5.4 + .env Docs (2026-03-07)
33+
34+
- Z-index stacking context fix, GPT-5.4 model catalog entry, .env.example provider keys
35+
- Password reset flow, SMTP mail module, JWT-scoped tokens
36+
- 1603 Python tests + 185 Vitest tests (1788 total)
837

938
### Consensus Navigation & Collapsible Sections
1039

@@ -195,3 +224,9 @@ Phase 0 benchmark framework — fully functional, pilot-tested on 5 questions.
195224
| 2026-03-07 | GPT-5.4 added to model catalog (1M ctx, $2.50/$15.00, no-temperature) | Done |
196225
| 2026-03-07 | .env.example updated with provider API key placeholders | Done |
197226
| 2026-03-07 | README updated with all provider env vars | Done |
227+
| 2026-03-08 | Question refinement (analyze → clarify → enrich → consensus) | In Progress |
228+
| 2026-03-08 | Native provider web search (Anthropic/Google/Mistral/OpenAI/Perplexity) | In Progress |
229+
| 2026-03-08 | Citations extraction + frontend CitationList + ConsensusNav Sources | In Progress |
230+
| 2026-03-08 | Tools enabled by default (web_search wired through CLI/REST/WS) | In Progress |
231+
| 2026-03-08 | Provider tool format fix (generic → native transform per provider) | In Progress |
232+
| 2026-03-08 | Sidebar UX (new-question button, collapsible toggle) | In Progress |

0 commit comments

Comments
 (0)