diff --git a/docs/source/index.md b/docs/source/index.md index 03fb861f..44d0e3e8 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -80,6 +80,13 @@ Overview <./evaluation/index.md> Benchmarks <./evaluation/benchmarks/index.md> ``` +```{toctree} +:hidden: +:caption: Profiling + +Profiling & Cost Analysis <./profiling/index.md> +``` + ```{toctree} :hidden: :caption: Deployment diff --git a/docs/source/profiling/index.md b/docs/source/profiling/index.md new file mode 100644 index 00000000..9e1620a9 --- /dev/null +++ b/docs/source/profiling/index.md @@ -0,0 +1,272 @@ + + +# Profiling and Cost Analysis + +The AI-Q blueprint integrates with the NeMo Agent Toolkit (NAT) profiler to capture detailed execution traces from every evaluation run. These traces record every LLM call, tool invocation, token count, and timestamp across the full multi-agent pipeline. A post-eval tokenomics report then combines that trace data with your configured pricing to produce a complete cost and performance breakdown — down to individual LLM calls and external API charges. + +```{note} +Profiling is a post-eval analysis feature. You run the agent normally via `nat eval`; the profiler is activated by adding a `profiler:` block to your eval config. No changes to agent code are required. +``` + +## What the Profiler Captures + +Each profiling run produces two output files in `eval.general.output_dir`: + +| File | Contents | +|------|----------| +| `all_requests_profiler_traces.json` | One entry per query. Each entry contains every event (LLM calls, tool calls, workflow start/end) with full token usage, timestamps, and model names. | +| `standardized_data_all.csv` | A flat CSV view of the same events, enriched with NAT-computed metrics such as predicted output sequence length (NOVA-Predicted-OSL), token uniqueness, and bottleneck flags. | + +## Enabling the Profiler + +Add a `profiler:` block under `eval.general` in your config file. The profiling config for the Deep Research Bench is at: + +``` +frontends/benchmarks/deepresearch_bench/configs/config_deep_research_bench_profiling.yml +``` + +The relevant `eval` section looks like this: + +```yaml +eval: + general: + workflow_alias: "aiq-deepresearcher" + output_dir: frontends/benchmarks/deepresearch_bench/results + max_concurrency: 4 + profiler: + # Compute inter-query token uniqueness (measures how much prompt content is reused) + token_uniqueness_forecast: true + # Estimate expected wall-clock runtime given the observed concurrency pattern + workflow_runtime_forecast: true + # Compute ISL/OSL/TPS and related LLM efficiency metrics + compute_llm_metrics: true + # Exclude large I/O text from the CSV to keep it structurally valid + csv_exclude_io_text: true + # Identify common prompt prefixes that are good candidates for prompt caching + prompt_caching_prefixes: + enable: true + min_frequency: 0.1 + # Identify the critical path and nested bottlenecks in the agent call graph + bottleneck_analysis: + enable_nested_stack: true + # Detect concurrency spikes that cause queuing + concurrency_spike_analysis: + enable: true + spike_threshold: 7 + # Build a prediction trie to generate Dynamo routing hints + prediction_trie: + enable: true + auto_sensitivity: true + sensitivity_scale: 5 + # Scoring weights (must sum to 1.0) + w_critical: 0.5 + w_fanout: 0.3 + w_position: 0.2 + w_parallel: 0.0 + dataset: + _type: json + file_path: frontends/benchmarks/deepresearch_bench/data/drb_full_dataset.json + structure: + question_key: question + answer_key: expected_output + generated_answer_key: generated_answer + filter: + allowlist: + field: + id: [88, 80, 84, 90, 59, 51, 94, 96, 91, 99, 93, 86, 67, 100, 72, 76] +``` + +### Profiler option reference + +| Option | Description | +|--------|-------------| +| `token_uniqueness_forecast` | Measures the fraction of prompt tokens that are unique across queries. High uniqueness means little opportunity for cross-query caching. | +| `workflow_runtime_forecast` | Estimates how long the full dataset would take to process at the observed concurrency level. Useful for capacity planning. | +| `compute_llm_metrics` | Emits per-call ISL, OSL, TPS, and latency into the CSV. Required for the tokenomics report's token distribution charts. | +| `csv_exclude_io_text` | Strips raw prompt/completion text from the CSV output. Keeps the file manageable when completions are long. Does not affect the JSON trace. | +| `prompt_caching_prefixes.min_frequency` | Only report a common prefix if it appears in at least this fraction of calls (0.1 = 10%). Reduces noise from incidental prefix matches. | +| `bottleneck_analysis.enable_nested_stack` | Produces a nested critical-path stack rather than a simple flat one. More accurate for deeply nested agent graphs. | +| `concurrency_spike_analysis.spike_threshold` | Number of simultaneous in-flight LLM calls that constitutes a spike. Spikes cause queuing and inflate p99 latency. | +| `prediction_trie` | Builds a routing trie for NVIDIA Dynamo. Each leaf carries a latency sensitivity score based on position on the critical path, fan-out, and call-index weighting. | + +## Running a Profiling Evaluation + +```bash +dotenv -f deploy/.env run nat eval \ + --config_file frontends/benchmarks/deepresearch_bench/configs/config_deep_research_bench_profiling.yml +``` + +The profiler runs automatically alongside `nat eval`. When the run completes, the output directory contains: + +``` +frontends/benchmarks/deepresearch_bench/results/ +├── all_requests_profiler_traces.json # raw per-event trace data +├── standardized_data_all.csv # flat CSV with NAT metrics +``` + +```{tip} +You can run a small subset of queries first using the `filter.allowlist` to validate the setup before committing to a full dataset run. The 16 question IDs in the config represent a diverse sample across domains and difficulty levels. +``` + +--- + +## Cost Analysis + +Running the profiler tells you *what happened*. The tokenomics report tells you *what it cost* — broken down by model, phase (Orchestrator / Planner / Researcher), and external tool API. + +### Why a dedicated cost report? + +LLM token costs alone do not capture the full picture of a research agent run: + +- **Search APIs are a significant cost driver.** In a typical Deep Research Bench run with 5 queries, Tavily advanced search accounts for roughly 95 calls at $0.016/call — around $1.52, or ~30% of the total run cost. +- **Phase attribution is invisible to standard tooling.** The Planner and Researcher subagents run as inline LangGraph graphs inside the orchestrator. Standard observability backends report all LLM calls under a single function name and cannot split cost by phase. +- **Cached tokens are billed at a discount.** Without explicit tracking, you cannot measure cache hit rates or quantify the savings from prompt caching. + +The tokenomics report addresses all three. It reconstructs phase attribution from timing windows in the NAT trace, separately tracks per-tool API charges, and reports cache savings alongside raw token costs. + +### Configuring Pricing + +Keep pricing in a **separate YAML** (for example `configs/config_tokenomics_pricing.yml`) and pass that file to the tokenomics report CLI. + +Declare prices under `tokenomics.pricing`: + +```yaml +tokenomics: + pricing: + models: + "azure/openai/gpt-5.2": + input_per_1m_tokens: 2.50 + output_per_1m_tokens: 10.00 + "nvidia/nemotron-3-nano-30b-a3b": + input_per_1m_tokens: 0.12 + output_per_1m_tokens: 0.50 + cached_input_per_1m_tokens: 0.10 # optional: omit to bill cached tokens at full input rate + tools: + # Key "web_search" matches "advanced_web_search_tool" via substring lookup + "web_search": + cost_per_call: 0.016 + "paper_search": + cost_per_call: 0.0003 + # Fallback for any model not listed above. + # Set to null to raise an error on unknown models instead. + default: + input_per_1m_tokens: 1.00 + output_per_1m_tokens: 4.00 +``` + +You can optionally set `eval.general.output_dir` in that same file so the report’s default output path matches your eval artifacts directory (see `config_tokenomics_pricing.yml` in the bench configs). + +**Model name lookup** uses exact match first, then substring match, then the `default`. A key of `"gpt-5.2"` matches a trace model name of `"azure/openai/gpt-5.2"` because the key is a substring of the full name. + +**Tool name lookup** follows the same rule. A key of `"web_search"` matches `"advanced_web_search_tool"` because `"web_search"` is a substring of the tool name. Unknown tools default to $0 — no error is raised, so you only need to configure tools that have a real per-call cost. + +**`cached_input_per_1m_tokens`** is optional. When omitted, cached tokens are billed at the full input rate (no discount). Set it when your model provider charges a reduced rate for KV-cache hits. + +### Generating the Report + +After `nat eval` completes, run: + +```bash +PYTHONPATH=src python -m aiq_agent.tokenomics.report \ + --trace frontends/benchmarks/deepresearch_bench/results/all_requests_profiler_traces.json \ + --config frontends/benchmarks/deepresearch_bench/configs/config_tokenomics_pricing.yml +``` + +If the pricing YAML sets `eval.general.output_dir`, the report is written there as `tokenomics_report.html` when you omit `--output`. Otherwise it defaults to `/tokenomics_report.html`. + +If `standardized_data_all.csv` is present in the same directory as the trace, it is automatically loaded to enrich the report with NOVA-Predicted-OSL data. + +The output is a **self-contained HTML file** — no server, no dependencies. Open it directly in any browser. + +### Report Tabs + +The report is organized into six tabs. Each chart includes a subtitle explaining what to look for. + +#### Overview + +Top-level stat cards: total cost (LLM + tools), LLM cost, tool API cost, cache savings, prompt/completion token totals, and LLM call count. Below the cards, a per-query summary table and cost breakdown by model and phase. + +Use this tab for a quick health check: if tool API cost is comparable to LLM cost, search frequency is a primary optimization target. + +#### Cost + +| Chart | What it shows | +|-------|---------------| +| Cost Split by Model | Donut chart of budget allocation across models. | +| Cost by Phase | Horizontal bar: Orchestrator / Planner / Researcher. High Researcher share means many parallel search-heavy sub-tasks. | +| Tool API Cost by Tool | Per-tool total cost and call count. Shown as a call-count bar when all tool costs are $0 (pricing not yet configured). | +| Per-Query Cost Distribution | Histogram of query costs. Hidden when fewer than 10 queries are available. A long right tail means a few hard queries are inflating the average. | +| Cost by Phase per Query | Stacked bar: one column per query, one color per phase. Spots outlier queries and identifies which phase drove the spike. | + +#### Latency + +LLM and tool call latency at p50/p90/p99. A large gap between p50 and p99 for LLM calls usually means a few completions with very high output sequence length. Tool p90 above 10 s is a retrieval bottleneck. + +#### Tokens + +The most detailed tab. All statistics are over individual LLM call observations (not per-request aggregates), so percentile distributions are meaningful even for small query sets. + +| Chart | What to look for | +|-------|-----------------| +| ISL p50/p90/p99 by model | Rising p99 vs p50 means some calls hit much larger contexts. | +| OSL p50/p90/p99 by model | High p99 OSL means long reasoning chains or verbose outputs driving latency and cost. | +| Context Accumulation (ISL by call index) | Upward slope = history building up; plateau = caching or fresh-start. Dashed line = estimated system-prompt floor. | +| Throughput (TPS by model) | Low TPS with small OSL = network overhead, not slow generation. | +| Token Budget (cache breakdown) | Green = cached (cheaper); grey = uncached; blue = completion. Maximize green. | +| ISL vs Latency scatter | Diagonal trend = prompt-bound; flat cloud = compute-bound. | +| Token Mix by Phase | Which phase consumes tokens and how much is cached per phase. | +| NOVA-Predicted vs Actual OSL | Pre-call output length estimates vs actual. Hidden when estimates are post-hoc filled (trivially perfect, not informative). | + +#### Efficiency + +Latency/cost joint analysis: latency vs cost per query scatter, TPS vs ISL scatter, effective cost per 1K output tokens by model, and a model efficiency bubble chart (x = p90 latency, y = cost/1K output tokens, bubble size = call count). Bottom-left on the bubble chart is the ideal operating point. + +#### Pricing + +Configured input and output prices as bar charts, plus a full LLM pricing table and a tool pricing table. + +#### Per-Query + +Full per-query table: cost, ISL, OSL, cached tokens, ISL:OSL ratio, LLM call count, workflow duration, and the question text. Useful for identifying which specific queries drove unusual cost or latency. + +### Subagent Phase Attribution + +The Deep Research Agent runs three logical parts: an **Orchestrator**, a **Planner**, and one or more parallel **Researcher** instances. The workflow is registered as `deep_research_agent`. NAT profiler traces still include `FUNCTION_START` / `FUNCTION_END` for **tools** (for example search), but Planner and Researcher runs are implemented **inside the `task` tool** and do not get distinct `FUNCTION_*` names. Typical traces also omit per-step metadata such as `function_ancestry` for subagent identity. + +Phase attribution is therefore inferred from **timing windows**: each `task` TOOL_START/END carries `subagent_type` and brackets one subagent invocation. Each `LLM_END` uses **`event_timestamp`** (completion time): if it falls inside a task window, that phase applies; otherwise orchestrator. Overlapping researcher windows (parallel invocations) are all labelled `researcher-phase` — the instance is ambiguous, but the phase is correct. + +Cost breakdowns by phase stay accurate without native subagent scopes in NAT. If NAT later exposes phase on each step (for example via `function_ancestry` or explicit `FUNCTION_*` boundaries for subagents), the logic in `src/aiq_agent/tokenomics/nat_adapter.py` can be simplified to read that field instead of joining on timestamps. + +### Python API + +The tokenomics module can also be used programmatically: + +```python +import yaml +from aiq_agent.tokenomics import parse_trace, PricingRegistry + +with open("frontends/benchmarks/deepresearch_bench/configs/config_tokenomics_pricing.yml") as f: + config = yaml.safe_load(f) + +pricing = PricingRegistry.from_dict(config["tokenomics"]["pricing"]) +profiles = parse_trace( + "frontends/benchmarks/deepresearch_bench/results/all_requests_profiler_traces.json", + pricing, +) + +for prof in profiles: + print( + f"Query {prof.request_index}: " + f"${prof.grand_total_cost_usd:.4f} total " + f"(${prof.total_cost_usd:.4f} LLM + ${prof.total_tool_cost_usd:.4f} tools), " + f"{prof.total_prompt_tokens:,} ISL, {prof.total_completion_tokens:,} OSL, " + f"{prof.cache_hit_rate:.1%} cache hit" + ) + for ps in prof.phases: + print(f" {ps.phase} / {ps.model}: {ps.llm_calls} calls, ${ps.cost_usd:.4f}") +``` + +`parse_trace` returns one `RequestProfile` per query. Each profile contains per-phase cost and token totals (`prof.phases`), per-call LLM observations (`prof.llm_call_events`), per-call tool observations (`prof.tool_call_events`), and request-level aggregates. diff --git a/frontends/benchmarks/deepresearch_bench/configs/config_deep_research_bench_profiling.yml b/frontends/benchmarks/deepresearch_bench/configs/config_deep_research_bench_profiling.yml new file mode 100644 index 00000000..8534cd4a --- /dev/null +++ b/frontends/benchmarks/deepresearch_bench/configs/config_deep_research_bench_profiling.yml @@ -0,0 +1,137 @@ +# Deep Research Bench (DRB) Evaluation Configuration +# Uses Langchain Deep Agents for deep research evaluation + +general: + telemetry: + logging: + console: + _type: console + level: INFO + # tracing: + # phoenix: + # _type: phoenix + # endpoint: http://localhost:6006/v1/traces + # project: dev + # weave: + # _type: weave + # project: "nvidia-aiq/AIQ_v2_deepresearch_bench" + use_uvloop: true + +llms: + nemotron_nano_llm: + _type: nim + model_name: nvidia/nemotron-3-nano-30b-a3b + base_url: "https://integrate.api.nvidia.com/v1" + temperature: 1.0 + top_p: 1.0 + max_tokens: 128000 + max_retries: 10 + timeout: 600 + chat_template_kwargs: + enable_thinking: true + + openai_gpt_5_2: + _type: openai + model_name: "gpt-5.2" + api_key: ${OPENAI_API_KEY} + + gpt_oss_llm: + _type: nim + model_name: openai/gpt-oss-120b + base_url: "https://integrate.api.nvidia.com/v1" + temperature: 1.0 + top_p: 1.0 + max_tokens: 256000 + api_key: ${NVIDIA_API_KEY} + max_retries: 10 + + # Nemotron Super is compatible and tested with AIQ but has limited availability + # on the Build API due to high demand. + # Uncomment nemotron_super_llm below if the endpoint is accessible. + # nemotron_super_llm: + # _type: nim + # model_name: nvidia/nemotron-3-super-120b-a12b + # base_url: "https://integrate.api.nvidia.com/v1" + # temperature: 1.0 + # top_p: 1.0 + # max_tokens: 128000 + # api_key: ${NVIDIA_API_KEY} + # max_retries: 10 + # timeout: 600 + # chat_template_kwargs: + # enable_thinking: true + +functions: + paper_search_tool: + _type: paper_search + max_results: 5 + serper_api_key: ${SERPER_API_KEY} + + advanced_web_search_tool: + _type: tavily_web_search + max_results: 2 + advanced_search: true + + deep_research_agent: + _type: deep_research_agent + orchestrator_llm: openai_gpt_5_2 + # Nemotron Super can trigger ChatNVIDIA AssertionError: duplicate model id in + # available_models (langchain_nvidia_ai_endpoints). Use nano for profiling or + # point nemotron_super_llm at a dedicated NVCF base_url if you need Super. + researcher_llm: nemotron_nano_llm + planner_llm: nemotron_nano_llm + max_loops: 2 + tools: + - paper_search_tool + - advanced_web_search_tool + +workflow: + _type: deep_research_workflow + +eval: + general: + workflow_alias: "aiq-deepresearcher" + output_dir: frontends/benchmarks/deepresearch_bench/results + max_concurrency: 4 + profiler: + # Compute inter query token uniqueness + token_uniqueness_forecast: true + # Compute expected workflow runtime + workflow_runtime_forecast: true + # Compute inference optimization metrics + compute_llm_metrics: true + # Avoid dumping large text into the output CSV (helpful to not break structure) + csv_exclude_io_text: true + # Idenitfy common prompt prefixes + prompt_caching_prefixes: + enable: true + min_frequency: 0.1 + bottleneck_analysis: + # Can also be simple_stack + enable_nested_stack: true + concurrency_spike_analysis: + enable: true + spike_threshold: 7 + # Build a prediction trie for Dynamo routing hints + prediction_trie: + enable: true + # Auto-compute latency sensitivity per LLM call position + auto_sensitivity: true + sensitivity_scale: 5 + # Weights for the three scoring signals (must sum to 1.0) + w_critical: 0.5 + w_fanout: 0.3 + w_position: 0.2 + # Penalty for LLM calls that run in parallel with longer siblings (default 0.0) + w_parallel: 0.0 + dataset: + _type: json + file_path: frontends/benchmarks/deepresearch_bench/data/drb_full_dataset.json + structure: + question_key: question + answer_key: expected_output + generated_answer_key: generated_answer + filter: + allowlist: + field: + id: [88, 80, 84, 90, 59, 51, 94, 96, 91, 99, 93, 86, 67, 100, 72, 76] diff --git a/frontends/benchmarks/deepresearch_bench/configs/config_tokenomics_pricing.yml b/frontends/benchmarks/deepresearch_bench/configs/config_tokenomics_pricing.yml new file mode 100644 index 00000000..e8b6f9d6 --- /dev/null +++ b/frontends/benchmarks/deepresearch_bench/configs/config_tokenomics_pricing.yml @@ -0,0 +1,33 @@ +# Tokenomics pricing only (not loaded by `nat eval`). +# +# NAT's top-level config schema rejects unknown keys; keep `tokenomics` here and pass +# this file to `python -m aiq_agent.tokenomics.report --config ...`. +# +# Prices are USD per 1 million tokens. Tool costs are USD per invocation. +# +# Optional: mirror `eval.general.output_dir` from your profiling eval config so the +# report defaults to the same folder as the trace when you omit `--output`. + +eval: + general: + output_dir: frontends/benchmarks/deepresearch_bench/results + +tokenomics: + pricing: + models: + "gpt-5.2": + input_per_1m_tokens: 2.50 + output_per_1m_tokens: 10.00 + "nvidia/nemotron-3-nano-30b-a3b": + input_per_1m_tokens: 0.12 + output_per_1m_tokens: 0.50 + cached_input_per_1m_tokens: 0.10 + tools: + # Key "web_search" matches "advanced_web_search_tool" via substring lookup. + "web_search": + cost_per_call: 0.016 + "paper_search": + cost_per_call: 0.0003 + default: + input_per_1m_tokens: 1.00 + output_per_1m_tokens: 4.00 diff --git a/src/aiq_agent/tokenomics/README.md b/src/aiq_agent/tokenomics/README.md new file mode 100644 index 00000000..b5f8a193 --- /dev/null +++ b/src/aiq_agent/tokenomics/README.md @@ -0,0 +1,174 @@ +# AIQ Tokenomics + +Post-eval analysis module for the Deep Research Agent. Parses a NAT profiler trace, attributes costs and token counts to workflow phases (Orchestrator / Planner / Researcher), and renders a self-contained interactive HTML report. + +--- + +## Background + +### The subagent attribution problem + +The workflow is registered as `deep_research_agent`, and NAT still emits `FUNCTION_START` / `FUNCTION_END` for **tools** (e.g. search). Planner and Researcher subagents are inline LangGraph graphs inside the **`task`** tool: they do not appear as their own `FUNCTION_*` scopes, and traces from this stack usually have no per-step metadata (such as `function_ancestry`) that identifies subagent phase. + +This module uses **timing-window attribution**: every `task` TOOL_START/END pair brackets one subagent run and carries `subagent_type` in the tool input. Each `LLM_END` is classified using its **`event_timestamp`** (completion time): if it falls inside a task window, that phase applies; otherwise orchestrator. Overlapping researcher windows (parallel invocations) all yield `researcher-phase` — correct phase even when the specific instance is ambiguous. + +--- + +## File structure + +``` +src/aiq_agent/tokenomics/ +├── pricing.py # PricingRegistry — maps model names to per-token prices +├── profile.py # RequestProfile, PhaseStats — structured data classes +├── nat_adapter.py # parse_trace() — NAT JSON → list[RequestProfile] +└── report.py # generate_report() — builds and renders HTML dashboard +``` + +--- + +## Pricing configuration + +Pricing lives in a YAML file under `tokenomics.pricing`. Prices are in **USD per 1 million tokens**. + + +```yaml +# frontends/benchmarks/deepresearch_bench/configs/config_tokenomics_pricing.yml + +tokenomics: + pricing: + models: + "openai/gpt-oss-120b": + input_per_1m_tokens: 1.75 + output_per_1m_tokens: 14.00 + "nvidia/nemotron-3-nano-30b-a3b": + input_per_1m_tokens: 0.12 + output_per_1m_tokens: 0.50 + cached_input_per_1m_tokens: 0.06 # optional — defaults to input price if omitted + tools: + # Tool name lookup is substring-based: "web_search" matches "advanced_web_search_tool" + # and "tavily_search" because the key is a substring of those names. + "web_search": + cost_per_call: 0.016 + "paper_search": + cost_per_call: 0.0003 + # Fallback for any model not explicitly listed. + # Set to null to raise an error on unknown models instead. + default: + input_per_1m_tokens: 1.00 + output_per_1m_tokens: 4.00 +``` + +Model name lookup is: exact match → substring match → default. This means a key of `"gpt-oss"` will match a trace model name of `"openai/gpt-oss-120b"`. + +Tool name lookup follows the same substring rule. Unknown tools default to $0/call — no error is raised, so you can configure only the costly tools and omit free internal ones. + +--- + +## Generating a report + +Run after `nat eval` completes. The trace file is written to the `output_dir` configured in the eval config. + +```bash +PYTHONPATH=src python -m aiq_agent.tokenomics.report \ + --trace frontends/benchmarks/deepresearch_bench/results/all_requests_profiler_traces.json \ + --config frontends/benchmarks/deepresearch_bench/configs/config_tokenomics_pricing.yml \ + [--output path/to/report.html] +``` + +If `--output` is omitted, the report is written to `/tokenomics_report.html`. + +If `standardized_data_all.csv` exists in the same directory as the trace, it is automatically loaded to enrich the report with any additional NOVA metadata fields. + +--- + +## Report tabs + +### 📊 Overview +Top-level stat cards (total cost, cache savings, token totals, LLM call count) plus a per-query summary table and cost split by model and phase. + +### 💰 Cost +- **Cost split by model** — donut chart of budget allocation +- **Cost by phase** — which of Orchestrator / Planner / Researcher drove most spend +- **Cost by phase per query (stacked bar)** — spots outlier queries and which phase drove the spike +- **Per-query cost histogram** — shape of cost distribution (shown only when ≥ 10 queries; wide right tail = high query difficulty variance) + +### ⏱ Latency +- **LLM latency p50/p90/p99 by model** — a large gap between p50 and p99 means occasional very long completions; if p50 is already slow the bottleneck is network or server load +- **Tool latency p50/p90/p99** — search/web tools typically 3–8 s; p90 > 10 s is a retrieval bottleneck + +### 🪙 Tokens +The most detailed tab. All statistics are across individual LLM call observations (not per-request aggregates), so distributions are meaningful even for small query sets. + +| Chart | What to look for | +|-------|-----------------| +| **ISL p50/p90/p99 by model** | Rising p99 vs p50 = some calls hit much larger contexts | +| **OSL p50/p90/p99 by model** | High p99 OSL = long reasoning chains or verbose outputs driving latency and cost | +| **Context accumulation (ISL by call index)** | Upward slope = history building up; plateau = caching or fresh-start; dashed line = estimated system-prompt floor | +| **Throughput (TPS by model)** | Low TPS with small OSL = network overhead, not slow generation | +| **Token budget (cache breakdown)** | Green = cached (cheaper); grey = uncached; blue = completion. Maximise green. | +| **ISL vs latency scatter** | Diagonal trend = prompt-bound; flat cloud = compute-bound | +| **Token mix by phase** | Which phase consumes tokens and how much is cached per phase | +| **Predicted vs Actual OSL** | Shown only when `NOVA-Predicted-OSL` contains real pre-call estimates (hidden when post-hoc filled) | + +### 📐 Efficiency +Latency/cost joint analysis: + +| Chart | What to look for | +|-------|-----------------| +| **Latency vs cost per query** | Top-right outliers are slow *and* expensive — highest-priority targets for optimization | +| **TPS vs ISL** | Downward slope = prompt-bound inference; KV-cache optimizations would help | +| **Effective cost per 1K output tokens** | True output cost after accounting for actual generation volume | +| **Model efficiency bubble** | Each bubble = one model; bottom-left = cheapest + fastest. Use for model selection trade-off analysis. | + +### 🏷 Pricing +Configured prices visualised as bar charts (input and output $/1M) plus a full pricing table. + +### 📋 Per-Query +Full per-query table: cost, ISL, OSL, cached tokens, ISL:OSL ratio, LLM call count, workflow duration, and the question text. + +--- + +## Python API + +```python +from aiq_agent.tokenomics import parse_trace, PricingRegistry + +# Load pricing from the tokenomics YAML (not the nat eval config) +import yaml +with open("frontends/benchmarks/deepresearch_bench/configs/config_tokenomics_pricing.yml") as f: + config = yaml.safe_load(f) +pricing = PricingRegistry.from_dict(config["tokenomics"]["pricing"]) + +# Parse trace → one RequestProfile per query +profiles = parse_trace("results/all_requests_profiler_traces.json", pricing) + +for prof in profiles: + print(f"Query {prof.request_index}: ${prof.total_cost_usd:.4f}, " + f"{prof.total_prompt_tokens:,} ISL, {prof.total_completion_tokens:,} OSL, " + f"{prof.cache_hit_rate:.1%} cache hit") + + for ps in prof.phases: + print(f" {ps.phase} / {ps.model}: {ps.llm_calls} calls, ${ps.cost_usd:.4f}") +``` + +### Key data classes + +**`RequestProfile`** — one per query +- `request_index`, `question`, `duration_s` +- `total_cost_usd`, `total_cache_savings_usd` +- `total_prompt_tokens`, `total_cached_tokens`, `total_completion_tokens` +- `phases: list[PhaseStats]` — per `(phase, model)` pair +- `tool_calls: dict[str, int]` — tool name → invocation count +- `llm_call_events: list[dict]` — per-call observations with `isl`, `osl`, `cached`, `dur_s`, `tps`, `model`, `phase`, `call_idx`, `uuid` +- `tool_call_events: list[dict]` — per-call observations with `tool`, `dur_s` + +**`PhaseStats`** — one per `(phase, model)` pair within a request +- `phase` — one of `"orchestrator"`, `"planner-agent"`, `"researcher-phase"` +- `model`, `llm_calls`, `prompt_tokens`, `cached_tokens`, `completion_tokens` +- `cost_usd`, `cache_savings_usd` +- Properties: `cache_hit_rate`, `uncached_tokens`, `total_tokens` + +**`PricingRegistry`** +- `PricingRegistry.from_dict(raw_dict)` — construct from the `tokenomics.pricing` config dict +- `registry.get(model_name) -> ModelPrice` — exact → substring → default lookup +- `registry.known_models() -> list[str]` diff --git a/src/aiq_agent/tokenomics/__init__.py b/src/aiq_agent/tokenomics/__init__.py new file mode 100644 index 00000000..13188cdd --- /dev/null +++ b/src/aiq_agent/tokenomics/__init__.py @@ -0,0 +1,30 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +from .nat_adapter import parse_trace +from .pricing import ModelPrice +from .pricing import ModelPriceConfig +from .pricing import PricingRegistry +from .pricing import PricingRegistryConfig +from .pricing import ToolPrice +from .pricing import ToolPriceConfig +from .profile import PHASE_ORCHESTRATOR +from .profile import PHASE_PLANNER +from .profile import PHASE_RESEARCHER +from .profile import PhaseStats +from .profile import RequestProfile + +__all__ = [ + "parse_trace", + "ModelPrice", + "ModelPriceConfig", + "PricingRegistry", + "PricingRegistryConfig", + "ToolPrice", + "ToolPriceConfig", + "PHASE_ORCHESTRATOR", + "PHASE_PLANNER", + "PHASE_RESEARCHER", + "PhaseStats", + "RequestProfile", +] diff --git a/src/aiq_agent/tokenomics/nat_adapter.py b/src/aiq_agent/tokenomics/nat_adapter.py new file mode 100644 index 00000000..1cc1c043 --- /dev/null +++ b/src/aiq_agent/tokenomics/nat_adapter.py @@ -0,0 +1,306 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +""" +NAT trace → list[RequestProfile] +================================= + +Converts a NAT profiler trace JSON file (produced by ``nat eval``) into +structured :class:`~aiq_agent.tokenomics.profile.RequestProfile` objects +ready for the tokenomics HTML report. + +Architecture note +----------------- +The workflow is registered as ``deep_research_agent``. NAT 1.5.0 traces still +emit ``FUNCTION_START`` / ``FUNCTION_END`` for **tools** (e.g. search helpers), +but **planner-agent** and **researcher-agent** runs live inside the ``task`` +tool: they do not get distinct ``FUNCTION_*`` names. Traces from this stack +typically have no per-step ``function_ancestry`` (or equivalent) carrying +subagent identity — calling ``subagent.ainvoke()`` does not surface as separate +NAT function scopes for Planner vs Researcher. + +Subagent attribution is therefore inferred post-hoc via timing windows: every +``task`` TOOL_START/END pair brackets one subagent invocation and carries +``subagent_type`` in its input. For each ``LLM_END`` we use that step's +``event_timestamp`` (completion time, not ``span_event_timestamp``): if it +lies inside a task window, the call is attributed to that phase; otherwise +**orchestrator-phase**. + +``_build_task_windows`` appends windows in ``task`` TOOL_END order. +``_infer_phase`` returns the **first** window in that list whose bounds contain +``ts``. Overlapping researcher windows share the same phase label, so order is +unimportant in the common parallel-researcher case. + +If NAT later attaches subagent phase directly on each step (e.g. +``function_ancestry`` or explicit ``FUNCTION_*`` scopes for subagents), +``_infer_phase`` can be replaced with a field read and the rest of this module +can stay the same. +""" + +from __future__ import annotations + +import ast +import json +import logging +from dataclasses import dataclass +from dataclasses import field +from typing import Any + +from .pricing import PricingRegistry +from .profile import PHASE_ORCHESTRATOR +from .profile import PHASE_PLANNER +from .profile import PHASE_RESEARCHER +from .profile import PhaseStats +from .profile import RequestProfile + +logger = logging.getLogger(__name__) + + +# --------------------------------------------------------------------------- +# Internal helpers +# --------------------------------------------------------------------------- + + +@dataclass +class _TaskWindow: + """Time span of a single subagent (task tool) invocation.""" + + uuid: str + subagent_type: str # "planner-agent" | "researcher-agent" + start_ts: float + end_ts: float = field(default=0.0) + + @property + def phase(self) -> str: + if self.subagent_type == "planner-agent": + return PHASE_PLANNER + return PHASE_RESEARCHER # any other subagent_type → researcher-phase + + +def _extract_subagent_type(raw_input: Any) -> str | None: + """Pull subagent_type out of the task tool's input field.""" + if isinstance(raw_input, dict): + return raw_input.get("subagent_type") + if isinstance(raw_input, str): + # NAT stores tool inputs as Python-repr strings, not JSON + try: + parsed = ast.literal_eval(raw_input) + if isinstance(parsed, dict): + return parsed.get("subagent_type") + except Exception: + pass + # Last resort: substring scan (handles malformed reprs) + for candidate in ("planner-agent", "researcher-agent"): + if candidate in raw_input: + return candidate + return None + + +def _build_task_windows(steps: list[dict]) -> list[_TaskWindow]: + """Build a list of completed task-tool windows from a request's steps.""" + open_windows: dict[str, _TaskWindow] = {} + closed: list[_TaskWindow] = [] + + for step in steps: + payload = step["payload"] + event_type = payload["event_type"] + name = payload.get("name", "") + uuid = payload["UUID"] + ts = payload["event_timestamp"] + + if event_type == "TOOL_START" and name == "task": + raw_input = (payload.get("data") or {}).get("input") + subagent_type = _extract_subagent_type(raw_input) + if subagent_type: + open_windows[uuid] = _TaskWindow(uuid=uuid, subagent_type=subagent_type, start_ts=ts) + else: + logger.debug("task TOOL_START missing subagent_type, uuid=%s", uuid) + + elif event_type == "TOOL_END" and name == "task": + win = open_windows.pop(uuid, None) + if win is not None: + win.end_ts = ts + closed.append(win) + + if open_windows: + logger.warning("%d task windows never closed (truncated trace?)", len(open_windows)) + + return closed + + +def _infer_phase(ts: float, windows: list[_TaskWindow]) -> str: + """ + Return the phase label for an LLM call from its ``LLM_END`` time ``ts``. + + ``windows`` is ordered by ``task`` TOOL_END (see ``_build_task_windows``). + The first window with ``start_ts <= ts <= end_ts`` wins. Overlapping + researcher windows all map to ``researcher-phase`` anyway. + """ + for win in windows: + if win.start_ts <= ts <= win.end_ts: + return win.phase + return PHASE_ORCHESTRATOR + + +def _parse_request(request_index: int, steps: list[dict], pricing: PricingRegistry) -> RequestProfile: + """Convert one request's step list into a RequestProfile.""" + + # --- Workflow timing and question --- + wf_start_ts = wf_end_ts = 0.0 + question = "" + for step in steps: + payload = step["payload"] + et = payload["event_type"] + if et == "WORKFLOW_START": + wf_start_ts = payload["event_timestamp"] + question = (payload.get("data") or {}).get("input") or "" + elif et == "WORKFLOW_END": + wf_end_ts = payload["event_timestamp"] + + duration_s = max(0.0, wf_end_ts - wf_start_ts) + + # --- Subagent phase windows --- + task_windows = _build_task_windows(steps) + + # --- Single forward pass: accumulate all events --- + phase_model_stats: dict[tuple[str, str], PhaseStats] = {} + model_call_counters: dict[str, int] = {} + llm_call_events: list[dict] = [] + tool_call_events: list[dict] = [] + tool_calls: dict[str, int] = {} + tool_start_times: dict[str, tuple[str, float]] = {} # uuid -> (name, start_ts) + + for step in steps: + payload = step["payload"] + et = payload["event_type"] + uuid = payload["UUID"] + ts = payload["event_timestamp"] + + if et == "TOOL_START": + name = payload.get("name") or "unknown" + tool_start_times[uuid] = (name, ts) + + elif et == "TOOL_END": + name = payload.get("name") or "unknown" + tool_calls[name] = tool_calls.get(name, 0) + 1 + dur_s = 0.0 + if uuid in tool_start_times: + _, start_ts = tool_start_times.pop(uuid) + dur_s = max(0.0, ts - start_ts) + tool_price = pricing.get_tool(name) + tool_call_events.append( + { + "tool": name, + "dur_s": round(dur_s, 3), + "cost_usd": tool_price.cost_per_call, + } + ) + + elif et == "LLM_END": + # span_event_timestamp is set by LangchainProfilerHandler at LLM_START + span_ts = payload.get("span_event_timestamp", ts) + model = payload.get("name") or "unknown" + usage = (payload.get("usage_info") or {}).get("token_usage") or {} + + prompt_tokens = usage.get("prompt_tokens", 0) + cached_tokens = usage.get("cached_tokens", 0) + completion_tokens = usage.get("completion_tokens", 0) + reasoning_tokens = usage.get("reasoning_tokens", 0) + + dur_s = max(0.0, ts - span_ts) + tps = completion_tokens / dur_s if dur_s > 0 else 0.0 + + # Window match uses LLM_END event_timestamp (completion), not span_event_timestamp. + phase = _infer_phase(ts, task_windows) + key = (phase, model) + + if key not in phase_model_stats: + phase_model_stats[key] = PhaseStats(phase=phase, model=model) + + try: + price = pricing.get(model) + cost = price.cost(prompt_tokens, cached_tokens, completion_tokens) + savings = price.cache_savings(cached_tokens) + except KeyError: + logger.warning("No price for model %r — cost will be 0", model) + cost = savings = 0.0 + + ps = phase_model_stats[key] + ps.llm_calls += 1 + ps.prompt_tokens += prompt_tokens + ps.cached_tokens += cached_tokens + ps.completion_tokens += completion_tokens + ps.cost_usd += cost + ps.cache_savings_usd += savings + + # Per-call observation (for distribution charts) + call_idx = model_call_counters.get(model, 0) + model_call_counters[model] = call_idx + 1 + + llm_call_events.append( + { + "uuid": uuid, + "isl": prompt_tokens, + "osl": completion_tokens, + "cached": cached_tokens, + "reasoning": reasoning_tokens, + "dur_s": round(dur_s, 3), + "tps": round(tps, 2), + "model": model, + "phase": phase, + "call_idx": call_idx, + } + ) + + # --- Roll up to request-level totals --- + phases = list(phase_model_stats.values()) + total_tool_cost_usd = sum(ev["cost_usd"] for ev in tool_call_events) + return RequestProfile( + request_index=request_index, + question=question, + duration_s=duration_s, + phases=phases, + tool_calls=tool_calls, + llm_call_events=llm_call_events, + tool_call_events=tool_call_events, + total_llm_calls=sum(p.llm_calls for p in phases), + total_prompt_tokens=sum(p.prompt_tokens for p in phases), + total_cached_tokens=sum(p.cached_tokens for p in phases), + total_completion_tokens=sum(p.completion_tokens for p in phases), + total_cost_usd=sum(p.cost_usd for p in phases), + total_tool_cost_usd=total_tool_cost_usd, + total_cache_savings_usd=sum(p.cache_savings_usd for p in phases), + ) + + +# --------------------------------------------------------------------------- +# Public API +# --------------------------------------------------------------------------- + + +def parse_trace(path: str, pricing: PricingRegistry) -> list[RequestProfile]: + """ + Parse a NAT profiler trace JSON file and return one + :class:`~aiq_agent.tokenomics.profile.RequestProfile` per request. + + Parameters + ---------- + path: + Path to the ``all_requests_profiler_traces.json`` file produced by + ``nat eval``. + pricing: + A :class:`~aiq_agent.tokenomics.pricing.PricingRegistry` built from + the ``tokenomics.pricing`` section of the eval config YAML. + """ + with open(path) as f: + data = json.load(f) + + profiles = [] + for item in data: + idx = item.get("request_number", len(profiles)) + steps = item.get("intermediate_steps", []) + try: + profiles.append(_parse_request(idx, steps, pricing)) + except Exception: + logger.exception("Failed to parse request %d — skipping", idx) + + return profiles diff --git a/src/aiq_agent/tokenomics/pricing.py b/src/aiq_agent/tokenomics/pricing.py new file mode 100644 index 00000000..0f143dc5 --- /dev/null +++ b/src/aiq_agent/tokenomics/pricing.py @@ -0,0 +1,181 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +from __future__ import annotations + +from dataclasses import dataclass +from typing import Any + +from pydantic import BaseModel +from pydantic import Field + +# --------------------------------------------------------------------------- +# Config models (Pydantic) — deserialised from YAML +# --------------------------------------------------------------------------- + + +class ModelPriceConfig(BaseModel): + """Per-token prices for one model, in USD per 1 M tokens.""" + + input_per_1m_tokens: float + output_per_1m_tokens: float + # Optional — if omitted, cached tokens are billed at the full input rate + # (i.e. no caching discount). + cached_input_per_1m_tokens: float | None = None + + +class ToolPriceConfig(BaseModel): + """Per-call price for one tool (e.g. a search API).""" + + cost_per_call: float = 0.0 + + +class PricingRegistryConfig(BaseModel): + """ + Pricing table read from the ``tokenomics.pricing`` section of the eval + config YAML. ``models`` is keyed by the exact model name that appears in + NAT traces (e.g. ``"azure/openai/gpt-5.2"``). ``default`` is used as a + fallback when no model key matches. ``tools`` is keyed by tool name as it + appears in the trace. + """ + + models: dict[str, ModelPriceConfig] = Field(default_factory=dict) + tools: dict[str, ToolPriceConfig] = Field(default_factory=dict) + default: ModelPriceConfig | None = None + + +# --------------------------------------------------------------------------- +# Runtime objects +# --------------------------------------------------------------------------- + + +@dataclass(frozen=True) +class ModelPrice: + """Resolved per-token prices for a single model.""" + + input_per_1m_tokens: float + cached_input_per_1m_tokens: float + output_per_1m_tokens: float + + def cost(self, prompt_tokens: int, cached_tokens: int, completion_tokens: int) -> float: + """Return USD cost for one LLM call.""" + uncached = max(0, prompt_tokens - cached_tokens) + return ( + uncached * self.input_per_1m_tokens + + cached_tokens * self.cached_input_per_1m_tokens + + completion_tokens * self.output_per_1m_tokens + ) / 1_000_000 + + def cache_savings(self, cached_tokens: int) -> float: + """USD saved vs. paying full input price for cached tokens.""" + return cached_tokens * (self.input_per_1m_tokens - self.cached_input_per_1m_tokens) / 1_000_000 + + +@dataclass(frozen=True) +class ToolPrice: + """Resolved per-call price for a single tool.""" + + cost_per_call: float = 0.0 + + +class PricingRegistry: + """ + Maps model names to :class:`ModelPrice` objects and tool names to + :class:`ToolPrice` objects. + + Model lookup order: + 1. Exact match on ``model_name``. + 2. Substring match — useful for versioned or provider-prefixed names + (e.g. ``"azure/openai/gpt-5.2"`` matches key ``"gpt-5.2"``). + 3. ``default`` price, if configured. + 4. :class:`KeyError`. + + Tool lookup order: + 1. Exact match on ``tool_name``. + 2. Substring match (key in name, or name in key). + 3. Zero-cost default (tool costs are optional — no KeyError raised). + """ + + def __init__( + self, + prices: dict[str, ModelPrice], + default: ModelPrice | None = None, + tools: dict[str, ToolPrice] | None = None, + ): + self._prices = prices + self._default = default + self._tools: dict[str, ToolPrice] = tools or {} + + # ------------------------------------------------------------------ + # Construction + # ------------------------------------------------------------------ + + @classmethod + def from_config(cls, config: PricingRegistryConfig) -> PricingRegistry: + prices = {} + for name, cfg in config.models.items(): + cached = cfg.cached_input_per_1m_tokens + if cached is None: + cached = cfg.input_per_1m_tokens + prices[name] = ModelPrice( + input_per_1m_tokens=cfg.input_per_1m_tokens, + cached_input_per_1m_tokens=cached, + output_per_1m_tokens=cfg.output_per_1m_tokens, + ) + + default = None + if config.default is not None: + cached = config.default.cached_input_per_1m_tokens + if cached is None: + cached = config.default.input_per_1m_tokens + default = ModelPrice( + input_per_1m_tokens=config.default.input_per_1m_tokens, + cached_input_per_1m_tokens=cached, + output_per_1m_tokens=config.default.output_per_1m_tokens, + ) + + tools = {name: ToolPrice(cost_per_call=cfg.cost_per_call) for name, cfg in config.tools.items()} + + return cls(prices, default, tools) + + @classmethod + def from_dict(cls, raw: dict[str, Any]) -> PricingRegistry: + """Convenience constructor: pass the raw ``tokenomics.pricing`` dict.""" + return cls.from_config(PricingRegistryConfig(**raw)) + + # ------------------------------------------------------------------ + # Lookup + # ------------------------------------------------------------------ + + def get(self, model_name: str) -> ModelPrice: + if model_name in self._prices: + return self._prices[model_name] + for key, price in self._prices.items(): + if key in model_name or model_name in key: + return price + if self._default is not None: + return self._default + raise KeyError( + f"No price configured for model {model_name!r}. " + "Add it to tokenomics.pricing.models in the config file, " + "or set tokenomics.pricing.default." + ) + + def get_tool(self, tool_name: str) -> ToolPrice: + """Return the :class:`ToolPrice` for ``tool_name``. + + Never raises — returns a zero-cost :class:`ToolPrice` if no match is + found, so unconfigured tools simply contribute $0. + """ + if tool_name in self._tools: + return self._tools[tool_name] + for key, price in self._tools.items(): + if key in tool_name or tool_name in key: + return price + return ToolPrice(cost_per_call=0.0) + + def known_models(self) -> list[str]: + return list(self._prices) + + def known_tools(self) -> list[str]: + return list(self._tools) diff --git a/src/aiq_agent/tokenomics/profile.py b/src/aiq_agent/tokenomics/profile.py new file mode 100644 index 00000000..89c2cb66 --- /dev/null +++ b/src/aiq_agent/tokenomics/profile.py @@ -0,0 +1,111 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +from __future__ import annotations + +from dataclasses import dataclass +from dataclasses import field + +# Canonical phase names produced by nat_adapter. +PHASE_ORCHESTRATOR = "orchestrator" +PHASE_PLANNER = "planner-agent" +PHASE_RESEARCHER = "researcher-phase" + +PHASE_ORDER = (PHASE_ORCHESTRATOR, PHASE_PLANNER, PHASE_RESEARCHER) + + +@dataclass +class PhaseStats: + """ + Token and cost totals for one (phase, model) combination within a single + workflow run. Multiple models can contribute to the same phase (e.g. + if the orchestrator LLM is swapped mid-run), so the primary grouping key + is ``(phase, model)``. + """ + + phase: str + model: str + llm_calls: int = 0 + prompt_tokens: int = 0 + cached_tokens: int = 0 + completion_tokens: int = 0 + cost_usd: float = 0.0 + cache_savings_usd: float = 0.0 + + @property + def uncached_tokens(self) -> int: + return max(0, self.prompt_tokens - self.cached_tokens) + + @property + def cache_hit_rate(self) -> float: + return self.cached_tokens / self.prompt_tokens if self.prompt_tokens else 0.0 + + @property + def total_tokens(self) -> int: + return self.prompt_tokens + self.completion_tokens + + +@dataclass +class RequestProfile: + """ + All tokenomics data for a single workflow run, pre-aggregated along the + dimensions needed by the tokenomics HTML report. + """ + + request_index: int + question: str + duration_s: float + + # Aggregates across all phases + total_cost_usd: float = 0.0 + total_tool_cost_usd: float = 0.0 + total_prompt_tokens: int = 0 + total_cached_tokens: int = 0 + total_completion_tokens: int = 0 + total_cache_savings_usd: float = 0.0 + total_llm_calls: int = 0 + + # One entry per (phase, model) pair — populated by nat_adapter + phases: list[PhaseStats] = field(default_factory=list) + + # tool_name → invocation count + tool_calls: dict[str, int] = field(default_factory=dict) + + # Individual LLM call observations (one dict per LLM_END event): + # keys: isl, osl, cached, reasoning, dur_s, tps, model, phase, call_idx + llm_call_events: list[dict] = field(default_factory=list) + + # Individual tool call observations (one dict per TOOL_END event): + # keys: tool, dur_s + tool_call_events: list[dict] = field(default_factory=list) + + # ------------------------------------------------------------------ + # Derived properties + # ------------------------------------------------------------------ + + @property + def grand_total_cost_usd(self) -> float: + """LLM token costs + tool API call costs combined.""" + return self.total_cost_usd + self.total_tool_cost_usd + + @property + def cache_hit_rate(self) -> float: + return self.total_cached_tokens / self.total_prompt_tokens if self.total_prompt_tokens else 0.0 + + @property + def total_tool_calls(self) -> int: + return sum(self.tool_calls.values()) + + def phases_for(self, phase: str) -> list[PhaseStats]: + """Return all PhaseStats entries matching ``phase`` (may span models).""" + return [p for p in self.phases if p.phase == phase] + + def cost_for_phase(self, phase: str) -> float: + return sum(p.cost_usd for p in self.phases_for(phase)) + + def tokens_for_phase(self, phase: str) -> tuple[int, int, int]: + """Return (prompt, cached, completion) totals for a phase.""" + prompt = sum(p.prompt_tokens for p in self.phases_for(phase)) + cached = sum(p.cached_tokens for p in self.phases_for(phase)) + completion = sum(p.completion_tokens for p in self.phases_for(phase)) + return prompt, cached, completion diff --git a/src/aiq_agent/tokenomics/report/__init__.py b/src/aiq_agent/tokenomics/report/__init__.py new file mode 100644 index 00000000..ad286357 --- /dev/null +++ b/src/aiq_agent/tokenomics/report/__init__.py @@ -0,0 +1,120 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +""" +Generate a self-contained tokenomics HTML report from a NAT profiler trace. + +Single-run +---------- +python -m aiq_agent.tokenomics.report \\ + --trace results/all_requests_profiler_traces.json \\ + --config configs/config_tokenomics_pricing.yml \\ + [--output results/tokenomics_report.html] + +Comparison (two or more runs) +------------------------------ +python -m aiq_agent.tokenomics.report \\ + --trace results/run_a/all_requests_profiler_traces.json \\ + --trace results/run_b/all_requests_profiler_traces.json \\ + --config configs/config_tokenomics_pricing.yml + +Passing ``--trace`` more than once activates comparison mode: every tab +(Overview, Cost, Latency, Tokens, Efficiency, Per-Query) shows A-vs-B +comparison charts instead of single-run visualisations. +""" + +from __future__ import annotations + +import sys +from pathlib import Path + +import yaml + +from ..nat_adapter import parse_trace +from ..pricing import PricingRegistry +from ._report_builders import _build_comparison_data +from ._report_builders import _build_report_data +from ._report_stats import _load_csv_predictions +from ._report_template_comparison import render_html as _render_comparison +from ._report_template_single import render_html as _render_single + + +def generate_report( + trace_path: str | list[str], + config_path: str, + output_path: str | None = None, +) -> str: + """Generate a tokenomics HTML report. + + Parameters + ---------- + trace_path: + Path to a single ``all_requests_profiler_traces.json``, or a list of + paths for comparison mode. When more than one path is provided every + tab (Overview, Cost, Latency, Tokens, Efficiency, Per-Query) shows + A-vs-B comparison charts instead of single-run visualisations. + config_path: + Path to the eval config YAML (provides pricing). + output_path: + Destination HTML file. Defaults to ``/tokenomics_report.html``. + """ + if isinstance(trace_path, str): + trace_paths: list[str] = [trace_path] + else: + trace_paths = list(trace_path) + + with open(config_path) as f: + config = yaml.safe_load(f) + + pricing_raw = (config.get("tokenomics") or {}).get("pricing") or {} + pricing = PricingRegistry.from_dict(pricing_raw) + + run_datas: list[dict] = [] + for tp in trace_paths: + profiles = parse_trace(tp, pricing) + if not profiles: + print(f"WARNING: no request profiles parsed — check {tp}", file=sys.stderr) + + predicted_osl_map = _load_csv_predictions(tp) + if predicted_osl_map: + print(f"Loaded {len(predicted_osl_map)} NOVA-Predicted-OSL values from {Path(tp).name}.") + + rd = _build_report_data(profiles, pricing, config_path, predicted_osl_map) + # In multi-run mode use the trace's parent directory name as the run label + # so the comparison tab can distinguish "run_a" from "run_b". + if len(trace_paths) > 1: + rd["label"] = Path(tp).parent.name or Path(tp).stem + run_datas.append(rd) + + primary = run_datas[0] + if len(run_datas) >= 2: + cmp = _build_comparison_data(run_datas) + primary["comparison"] = cmp + print( + f"Comparison mode: " + f"Run A ({cmp['label_a']}) = {cmp['num_queries_a']} queries, " + f"Run B ({cmp['label_b']}) = {cmp['num_queries_b']} queries, " + f"{cmp['num_common_queries']} aligned by query ID." + ) + if cmp["num_common_queries"] == 0: + print( + " WARNING: no overlapping query IDs — per-query deltas will be empty.\n" + " Make sure both runs use the same filter.allowlist.", + file=sys.stderr, + ) + else: + primary["comparison"] = None + + html = _render_comparison(primary) if len(run_datas) >= 2 else _render_single(primary) + + if output_path is None: + output_dir = (config.get("eval") or {}).get("general", {}).get("output_dir") + if output_dir: + output_path = str(Path(output_dir) / "tokenomics_report.html") + else: + output_path = str(Path(trace_paths[0]).parent / "tokenomics_report.html") + + with open(output_path, "w", encoding="utf-8") as fh: + fh.write(html) + + print(f"Report written → {output_path}") + return output_path diff --git a/src/aiq_agent/tokenomics/report/__main__.py b/src/aiq_agent/tokenomics/report/__main__.py new file mode 100644 index 00000000..a8510759 --- /dev/null +++ b/src/aiq_agent/tokenomics/report/__main__.py @@ -0,0 +1,30 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +"""CLI entry point: python -m aiq_agent.tokenomics.report""" + +from __future__ import annotations + +import argparse + +from . import generate_report + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Generate a tokenomics HTML report from a NAT profiler trace.") + parser.add_argument( + "--trace", + required=True, + action="append", + metavar="TRACE", + help=( + "Path to all_requests_profiler_traces.json. " + "Repeat the flag to compare multiple runs (e.g. --trace run_a/traces.json --trace run_b/traces.json)." + ), + ) + parser.add_argument("--config", required=True, help="Path to the eval config YAML") + parser.add_argument( + "--output", + default=None, + help="Output HTML path (default: /tokenomics_report.html)", + ) + args = parser.parse_args() + generate_report(args.trace, args.config, args.output) diff --git a/src/aiq_agent/tokenomics/report/_report_base.py b/src/aiq_agent/tokenomics/report/_report_base.py new file mode 100644 index 00000000..03cc6d7e --- /dev/null +++ b/src/aiq_agent/tokenomics/report/_report_base.py @@ -0,0 +1,202 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +"""Shared HTML/CSS/JS constants reused by both single-run and comparison report templates.""" + +from __future__ import annotations + +_CSS = """""" + +_JS_LAYOUT_GLOBALS = """ +// ── layout defaults ─────────────────────────────────────────────────────────── +const LAYOUT_BASE = { + paper_bgcolor: '#161b22', + plot_bgcolor: '#161b22', + font: { color: '#e6edf3', size: 12, family: '-apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif' }, + margin: { t: 30, r: 20, b: 50, l: 60 }, + colorway: ['#58a6ff','#3fb950','#d29922','#bc8cff','#f85149','#39d353','#76b900','#ff7b72','#ffa657'], + xaxis: { gridcolor: '#30363d', zerolinecolor: '#30363d' }, + yaxis: { gridcolor: '#30363d', zerolinecolor: '#30363d' }, + legend: { bgcolor: 'rgba(0,0,0,0)', bordercolor: '#30363d' }, +}; +const CFG = { responsive: true, displayModeBar: false }; +function L(extra) { return Object.assign({}, LAYOUT_BASE, extra); } + +// ── helpers ─────────────────────────────────────────────────────────────────── +function fmtK(v) { + v = +v; + return v >= 1e6 ? (v/1e6).toFixed(2)+'M' : v >= 1e3 ? (v/1e3).toFixed(1)+'k' : String(Math.round(v)); +} +function fmt$(v, d=4) { return v == null ? 'N/A' : '$' + (+v).toFixed(d); } + +const PALETTE = ['#58a6ff','#3fb950','#d29922','#bc8cff','#f85149','#39d353','#76b900','#ff7b72','#ffa657']; +const PHASE_COLORS = { Orchestrator: '#58a6ff', Planner: '#bc8cff', Researcher: '#3fb950' }; +""" + +_JS_TAB_SWITCHER = """ +// ── tab switching ───────────────────────────────────────────────────────────── +let _rendered = {}; +function showTab(id, btn) { + document.querySelectorAll('.tab-content').forEach(el => el.classList.remove('active')); + document.querySelectorAll('nav button').forEach(el => el.classList.remove('active')); + document.getElementById('tab-' + id).classList.add('active'); + btn.classList.add('active'); + renderTab(id); +} +function renderTab(id) { + if (_rendered[id]) return; + _rendered[id] = true; + if (id === 'overview') renderOverview(); + if (id === 'cost') renderCost(); + if (id === 'latency') renderLatency(); + if (id === 'tokens') renderTokens(); + if (id === 'efficiency') renderEfficiency(); + if (id === 'detail') renderDetail(); +} +""" + +_HTML_TOP = r""" + + + + +""" + +_HTML_MID = r""" + +""" + +_HTML_AFTER_CSS = r""" + + + +
+

⚡ AIQ Tokenomics Report

+ +
+ + + +
+""" + +_JS_HEADER = r""" +
+ + + + +""" + + +def build_html( + title: str, + tab_html: str, + js_data_extras: str, + js_extra_globals: str, + js_init: str, + js_renders: str, +) -> str: + """Assemble a complete self-contained HTML report page.""" + return ( + _HTML_TOP + + title + + _HTML_MID + + _CSS + + _HTML_AFTER_CSS + + tab_html + + _JS_HEADER + + js_data_extras + + _JS_LAYOUT_GLOBALS + + js_extra_globals + + _JS_TAB_SWITCHER + + js_init + + js_renders + + _JS_FOOTER + ) diff --git a/src/aiq_agent/tokenomics/report/_report_builders.py b/src/aiq_agent/tokenomics/report/_report_builders.py new file mode 100644 index 00000000..a7e51397 --- /dev/null +++ b/src/aiq_agent/tokenomics/report/_report_builders.py @@ -0,0 +1,384 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +"""Data aggregation helpers for the tokenomics report.""" + +from __future__ import annotations + +from collections import defaultdict +from datetime import datetime +from pathlib import Path + +from ..pricing import PricingRegistry +from ..profile import PHASE_ORCHESTRATOR +from ..profile import PHASE_ORDER +from ..profile import PHASE_PLANNER +from ..profile import PHASE_RESEARCHER +from ..profile import RequestProfile +from ._report_stats import _latency_stats +from ._report_stats import _pct + +PHASE_LABELS = { + PHASE_ORCHESTRATOR: "Orchestrator", + PHASE_PLANNER: "Planner", + PHASE_RESEARCHER: "Researcher", +} + + +def _build_report_data( + profiles: list[RequestProfile], + pricing: PricingRegistry, + config_path: str, + predicted_osl_map: dict[str, float] | None = None, +) -> dict: + # Flatten all per-call observations + all_llm: list[dict] = [] + all_tool: list[dict] = [] + for prof in profiles: + all_llm.extend(prof.llm_call_events) + all_tool.extend(prof.tool_call_events) + + # ── Token stats by model ────────────────────────────────────────────── + m_isls: dict[str, list] = defaultdict(list) + m_osls: dict[str, list] = defaultdict(list) + m_tps: dict[str, list] = defaultdict(list) + m_tot: dict[str, dict] = defaultdict( + lambda: { + "calls": 0, + "total_isl": 0, + "total_osl": 0, + "total_cached": 0, + "total_reasoning": 0, + } + ) + for ev in all_llm: + m = ev["model"] + m_isls[m].append(ev["isl"]) + m_osls[m].append(ev["osl"]) + if ev["tps"] > 0: + m_tps[m].append(ev["tps"]) + t = m_tot[m] + t["calls"] += 1 + t["total_isl"] += ev["isl"] + t["total_osl"] += ev["osl"] + t["total_cached"] += ev["cached"] + t["total_reasoning"] += ev["reasoning"] + + by_model_tokens: dict[str, dict] = {} + for m, t in m_tot.items(): + isls = m_isls[m] + osls = m_osls[m] + tps_vals = m_tps[m] + by_model_tokens[m] = { + "calls": t["calls"], + "total_isl": t["total_isl"], + "total_osl": t["total_osl"], + "total_cached": t["total_cached"], + "total_reasoning": t["total_reasoning"], + "isl_mean": round(sum(isls) / len(isls), 1) if isls else 0.0, + "isl_p50": round(_pct(isls, 50), 1), + "isl_p90": round(_pct(isls, 90), 1), + "isl_p99": round(_pct(isls, 99), 1), + "isl_max": max(isls) if isls else 0, + "isl_min": min(isls) if isls else 0, + "osl_mean": round(sum(osls) / len(osls), 1) if osls else 0.0, + "osl_p50": round(_pct(osls, 50), 1), + "osl_p90": round(_pct(osls, 90), 1), + "osl_p99": round(_pct(osls, 99), 1), + "osl_max": max(osls) if osls else 0, + "cache_rate": t["total_cached"] / t["total_isl"] if t["total_isl"] > 0 else 0.0, + "tps_mean": round(sum(tps_vals) / len(tps_vals), 2) if tps_vals else 0.0, + "tps_p50": round(_pct(tps_vals, 50), 2) if tps_vals else 0.0, + "tps_p90": round(_pct(tps_vals, 90), 2) if tps_vals else 0.0, + } + + # ── Token stats by component (phase) ───────────────────────────────── + ph_isls: dict[str, list] = defaultdict(list) + ph_osls: dict[str, list] = defaultdict(list) + ph_tot: dict[str, dict] = defaultdict( + lambda: { + "calls": 0, + "total_isl": 0, + "total_osl": 0, + "total_cached": 0, + "total_reasoning": 0, + } + ) + for ev in all_llm: + ph = ev["phase"] + ph_isls[ph].append(ev["isl"]) + ph_osls[ph].append(ev["osl"]) + t = ph_tot[ph] + t["calls"] += 1 + t["total_isl"] += ev["isl"] + t["total_osl"] += ev["osl"] + t["total_cached"] += ev["cached"] + t["total_reasoning"] += ev["reasoning"] + + by_component_tokens: dict[str, dict] = {} + for ph in PHASE_ORDER: + if ph not in ph_tot: + continue + t = ph_tot[ph] + isls = ph_isls[ph] + osls = ph_osls[ph] + label = PHASE_LABELS.get(ph, ph) + by_component_tokens[label] = { + "calls": t["calls"], + "total_isl": t["total_isl"], + "total_osl": t["total_osl"], + "total_cached": t["total_cached"], + "total_reasoning": t["total_reasoning"], + "isl_mean": round(sum(isls) / len(isls), 1) if isls else 0.0, + "isl_p50": round(_pct(isls, 50), 1), + "isl_p90": round(_pct(isls, 90), 1), + "isl_p99": round(_pct(isls, 99), 1), + "isl_max": max(isls) if isls else 0, + "osl_mean": round(sum(osls) / len(osls), 1) if osls else 0.0, + "osl_p50": round(_pct(osls, 50), 1), + "osl_p90": round(_pct(osls, 90), 1), + "osl_p99": round(_pct(osls, 99), 1), + "osl_max": max(osls) if osls else 0, + "cache_rate": t["total_cached"] / t["total_isl"] if t["total_isl"] > 0 else 0.0, + } + + # ── ISL growth: avg ISL by sequential call index, per model ─────────── + growth_data: dict[str, dict[int, list]] = defaultdict(lambda: defaultdict(list)) + for ev in all_llm: + growth_data[ev["model"]][ev["call_idx"]].append(ev["isl"]) + + isl_growth: dict[str, list[dict]] = {} + for model in sorted(growth_data): + idx_map = growth_data[model] + isl_growth[model] = [ + {"idx": idx, "avg_isl": round(sum(v) / len(v), 1), "n": len(v)} + for idx in sorted(idx_map) + for v in [idx_map[idx]] + ] + + # ── ISL vs latency sample ───────────────────────────────────────────── + isl_latency_sample = [ + {"isl": ev["isl"], "dur_s": ev["dur_s"], "model": ev["model"], "osl": ev["osl"]} + for ev in all_llm + if ev["dur_s"] > 0 + ] + + # ── Sys-prompt estimate (min ISL per model) ─────────────────────────── + sys_prompt_est = {m: min(m_isls[m]) for m in m_isls if m_isls[m]} + + # ── LLM latency per model ───────────────────────────────────────────── + m_durs: dict[str, list] = defaultdict(list) + for ev in all_llm: + if ev["dur_s"] > 0: + m_durs[ev["model"]].append(ev["dur_s"]) + + llm_latency = {m: _latency_stats(durs) for m, durs in m_durs.items()} + + # ── Tool latency per tool ───────────────────────────────────────────── + t_durs: dict[str, list] = defaultdict(list) + for ev in all_tool: + if ev["dur_s"] > 0: + t_durs[ev["tool"]].append(ev["dur_s"]) + + tool_latency = {tool: _latency_stats(durs) for tool, durs in t_durs.items()} + + # ── Cost by model ───────────────────────────────────────────────────── + by_model_cost: dict[str, float] = defaultdict(float) + for prof in profiles: + for ps in prof.phases: + by_model_cost[ps.model] += ps.cost_usd + + # ── Cost by phase ───────────────────────────────────────────────────── + by_phase_cost: dict[str, float] = {} + for ph in PHASE_ORDER: + total = sum(prof.cost_for_phase(ph) for prof in profiles) + if total > 0: + by_phase_cost[PHASE_LABELS.get(ph, ph)] = round(total, 6) + + # ── Per-query list ──────────────────────────────────────────────────── + per_query = [] + for prof in profiles: + pq_by_phase = {} + for ph in PHASE_ORDER: + label = PHASE_LABELS.get(ph, ph) + cost = prof.cost_for_phase(ph) + if cost > 0: + pq_by_phase[label] = round(cost, 6) + per_query.append( + { + "id": prof.request_index, + "question": prof.question, + "cost_usd": round(prof.grand_total_cost_usd, 6), + "llm_cost_usd": round(prof.total_cost_usd, 6), + "tool_cost_usd": round(prof.total_tool_cost_usd, 6), + "input_tokens": prof.total_prompt_tokens, + "output_tokens": prof.total_completion_tokens, + "cached_tokens": prof.total_cached_tokens, + "entry_count": prof.total_llm_calls, + "duration_s": round(prof.duration_s, 2), + "by_phase": pq_by_phase, + } + ) + + # ── Pricing snapshot ────────────────────────────────────────────────── + pricing_snapshot: dict[str, dict] = {} + for model in pricing.known_models(): + p = pricing.get(model) + pricing_snapshot[model] = { + "input_per_1m_tokens": p.input_per_1m_tokens, + "cached_input_per_1m_tokens": p.cached_input_per_1m_tokens, + "output_per_1m_tokens": p.output_per_1m_tokens, + } + if pricing._default is not None: + pricing_snapshot["default"] = { + "input_per_1m_tokens": pricing._default.input_per_1m_tokens, + "cached_input_per_1m_tokens": pricing._default.cached_input_per_1m_tokens, + "output_per_1m_tokens": pricing._default.output_per_1m_tokens, + } + + # ── Tool cost aggregation ───────────────────────────────────────────── + by_tool_cost: dict[str, dict] = defaultdict(lambda: {"calls": 0, "total_cost_usd": 0.0}) + for ev in all_tool: + entry = by_tool_cost[ev["tool"]] + entry["calls"] += 1 + entry["total_cost_usd"] += ev.get("cost_usd", 0.0) + by_tool_cost = {k: dict(v) for k, v in by_tool_cost.items()} + + # Tool pricing snapshot (only configured tools) + tool_pricing_snapshot = {name: pricing.get_tool(name).cost_per_call for name in pricing.known_tools()} + + # ── Predicted vs actual OSL (from NOVA-Predicted-OSL in CSV) ───────── + # NOTE: in current NAT traces, NOVA-Predicted-OSL is filled post-hoc with + # the actual completion tokens, so predicted == actual on every call. + # The list is populated here for forward-compatibility; the chart is hidden + # when all errors are zero (trivially perfect, not informative). + predicted_vs_actual: list[dict] = [] + if predicted_osl_map: + for ev in all_llm: + pred = predicted_osl_map.get(ev.get("uuid", "")) + if pred is not None: + predicted_vs_actual.append( + { + "model": ev["model"], + "predicted": pred, + "actual": ev["osl"], + "phase": ev["phase"], + } + ) + + total_llm_cost = sum(p.total_cost_usd for p in profiles) + total_tool_cost = sum(p.total_tool_cost_usd for p in profiles) + grand_total = total_llm_cost + total_tool_cost + return { + "label": Path(config_path).name, + "generated_at": datetime.now().strftime("%Y-%m-%d %H:%M"), + "num_queries": len(profiles), + "total_cost_usd": round(grand_total, 6), + "llm_cost_usd": round(total_llm_cost, 6), + "tool_cost_usd": round(total_tool_cost, 6), + "avg_cost_usd": round(grand_total / len(profiles), 6) if profiles else 0.0, + "cache_savings_usd": round(sum(p.total_cache_savings_usd for p in profiles), 6), + "total_prompt_tokens": sum(p.total_prompt_tokens for p in profiles), + "total_cached_tokens": sum(p.total_cached_tokens for p in profiles), + "total_completion_tokens": sum(p.total_completion_tokens for p in profiles), + "total_llm_calls": sum(p.total_llm_calls for p in profiles), + "per_query": per_query, + "by_model": dict(by_model_cost), + "by_phase": by_phase_cost, + "by_tool": by_tool_cost, + "phase_order": [PHASE_LABELS.get(ph, ph) for ph in PHASE_ORDER], + "llm_latency": llm_latency, + "tool_latency": tool_latency, + "pricing_snapshot": pricing_snapshot, + "tool_pricing_snapshot": tool_pricing_snapshot, + "token_stats": { + "by_model": by_model_tokens, + "by_component": by_component_tokens, + "isl_growth": isl_growth, + "isl_latency_sample": isl_latency_sample, + "sys_prompt_est": sys_prompt_est, + "predicted_vs_actual": predicted_vs_actual, + }, + } + + +def _build_comparison_data(run_datas: list[dict]) -> dict: + """Return an A-vs-B comparison block to embed in the primary run's report_data. + + Only the first two runs are compared. The per-query list is the UNION of + both runs' query IDs so that queries unique to one run are still visible in + the table (with null for the missing side). The delta bar chart in the + report only renders bars for queries present in both runs. + """ + a = run_datas[0] + b = run_datas[1] + + a_by_id = {q["id"]: q for q in a.get("per_query", [])} + b_by_id = {q["id"]: q for q in b.get("per_query", [])} + all_ids = sorted(set(a_by_id) | set(b_by_id)) + common_ids = set(a_by_id) & set(b_by_id) + + per_query_cmp = [] + for qid in all_ids: + qa = a_by_id.get(qid) + qb = b_by_id.get(qid) + in_both = qa is not None and qb is not None + + cost_a = qa["cost_usd"] if qa else None + cost_b = qb["cost_usd"] if qb else None + if in_both: + cost_delta: float | None = round(cost_b - cost_a, 6) # type: ignore[operator] + cost_pct: float | None = round((cost_delta / cost_a * 100) if cost_a else 0.0, 1) + else: + cost_delta = cost_pct = None + + per_query_cmp.append( + { + "id": qid, + "question": (qa or qb).get("question", ""), # type: ignore[union-attr] + "cost_a": cost_a, + "cost_b": cost_b, + "cost_delta": cost_delta, + "cost_pct": cost_pct, + "isl_a": qa.get("input_tokens") if qa else None, + "isl_b": qb.get("input_tokens") if qb else None, + "osl_a": qa.get("output_tokens") if qa else None, + "osl_b": qb.get("output_tokens") if qb else None, + "duration_a": qa.get("duration_s") if qa else None, + "duration_b": qb.get("duration_s") if qb else None, + "llm_calls_a": qa.get("entry_count") if qa else None, + "llm_calls_b": qb.get("entry_count") if qb else None, + "in_both": in_both, + } + ) + + cost_delta_total = b["total_cost_usd"] - a["total_cost_usd"] + cost_pct_total = (cost_delta_total / a["total_cost_usd"] * 100) if a["total_cost_usd"] else 0.0 + prompt_a = a.get("total_prompt_tokens", 0) + prompt_b = b.get("total_prompt_tokens", 0) + + return { + "label_a": a["label"], + "label_b": b["label"], + "num_queries_a": a["num_queries"], + "num_queries_b": b["num_queries"], + "num_common_queries": len(common_ids), + "total_cost_a": a["total_cost_usd"], + "total_cost_b": b["total_cost_usd"], + "llm_cost_a": a["llm_cost_usd"], + "llm_cost_b": b["llm_cost_usd"], + "total_llm_calls_a": a["total_llm_calls"], + "total_llm_calls_b": b["total_llm_calls"], + "cache_rate_a": round(a.get("total_cached_tokens", 0) / prompt_a, 4) if prompt_a else 0.0, + "cache_rate_b": round(b.get("total_cached_tokens", 0) / prompt_b, 4) if prompt_b else 0.0, + "cost_delta": round(cost_delta_total, 6), + "cost_pct_change": round(cost_pct_total, 1), + "by_model_a": a.get("by_model", {}), + "by_model_b": b.get("by_model", {}), + "by_phase_a": a.get("by_phase", {}), + "by_phase_b": b.get("by_phase", {}), + "by_tool_b": b.get("by_tool", {}), + "llm_latency_b": b.get("llm_latency", {}), + "tool_latency_b": b.get("tool_latency", {}), + "token_stats_b": b.get("token_stats", {}), + "per_query": per_query_cmp, + } diff --git a/src/aiq_agent/tokenomics/report/_report_stats.py b/src/aiq_agent/tokenomics/report/_report_stats.py new file mode 100644 index 00000000..db30f591 --- /dev/null +++ b/src/aiq_agent/tokenomics/report/_report_stats.py @@ -0,0 +1,51 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +"""Pure statistical helpers for the tokenomics report. No project imports.""" + +from __future__ import annotations + +import csv +from pathlib import Path + + +def _pct(data: list, p: float) -> float: + """Return the p-th percentile of ``data`` (linear interpolation).""" + if not data: + return 0.0 + s = sorted(data) + k = (len(s) - 1) * p / 100.0 + lo, hi = int(k), min(int(k) + 1, len(s) - 1) + return s[lo] + (s[hi] - s[lo]) * (k - lo) + + +def _latency_stats(durations_s: list[float]) -> dict: + if not durations_s: + return {"count": 0, "p50_ms": 0.0, "p90_ms": 0.0, "p99_ms": 0.0, "max_ms": 0.0, "mean_ms": 0.0} + ms = [d * 1000.0 for d in durations_s] + return { + "count": len(ms), + "p50_ms": round(_pct(ms, 50), 2), + "p90_ms": round(_pct(ms, 90), 2), + "p99_ms": round(_pct(ms, 99), 2), + "max_ms": round(max(ms), 2), + "mean_ms": round(sum(ms) / len(ms), 2), + } + + +def _load_csv_predictions(trace_path: str) -> dict[str, float]: + """ + Load NOVA-Predicted-OSL values from standardized_data_all.csv if it lives + alongside the trace file. Returns UUID → predicted_osl mapping. + """ + csv_path = Path(trace_path).parent / "standardized_data_all.csv" + if not csv_path.exists(): + return {} + predictions: dict[str, float] = {} + with open(csv_path, newline="") as f: + for row in csv.DictReader(f): + if row.get("event_type") == "LLM_START" and row.get("NOVA-Predicted-OSL") and row.get("UUID"): + try: + predictions[row["UUID"]] = float(row["NOVA-Predicted-OSL"]) + except ValueError: + pass + return predictions diff --git a/src/aiq_agent/tokenomics/report/_report_template_comparison.py b/src/aiq_agent/tokenomics/report/_report_template_comparison.py new file mode 100644 index 00000000..7920a257 --- /dev/null +++ b/src/aiq_agent/tokenomics/report/_report_template_comparison.py @@ -0,0 +1,539 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +"""HTML template and render helper for comparison-mode tokenomics reports. No project imports.""" + +from __future__ import annotations + +import json + +from ._report_base import build_html + +_TAB_HTML = r""" + +
+
+
+
+
🤖 Cost by Model — Run A vs Run B +
Same model in both bars = same routing; model only in one run = model swap.
+
+
+
+
+
🏗 Cost by Phase — Run A vs Run B +
A phase cost shift often indicates fewer parallel calls or a changed routing policy.
+
+
+
+
+
+
📋 Per-Query Summary — Run A vs Run B
+
+ + + + + + + +
Query #Cost ACost BΔ CostΔ %ISL AISL BOSL AOSL BDur A (s)Dur B (s)Calls ACalls B
+
+
+
+ + +
+
+
+
🤖 Cost by Model — Run A vs Run B +
Same model in both bars = same routing, different volumes or prompts. + A model in only one run signals a model swap.
+
+
+
+
+
🏗 Cost by Phase — Run A vs Run B +
A phase cost shift (e.g. lower Researcher spend) often indicates fewer + parallel search calls or a changed routing policy.
+
+
+
+
+
+
🔍 Tool Cost — Run A vs Run B +
Compare per-tool API spend across the two runs.
+
+
+
+
+
📈 Per-Query Cost Delta (B − A) +
Green bars = Run B is cheaper; red bars = Run B costs more. + Only queries present in both runs are shown.
+
+
+
+
+ + +
+
+
+
📊 LLM p50 Latency — Run A vs Run B +
Median LLM response time per model across both runs.
+
+
+
+
+
📊 LLM p90 Latency — Run A vs Run B +
90th-percentile LLM response time per model across both runs.
+
+
+
+
+
+
🔍 Tool p90 Latency — Run A vs Run B +
90th-percentile tool latency across both runs.
+
+
+
+
+ + +
+
+
+
+
📥 ISL p50 — Run A vs Run B +
Median prompt token count per model across both runs.
+
+
+
+
+
📥 ISL p90 — Run A vs Run B +
90th-percentile prompt tokens per model across both runs.
+
+
+
+
+
+
+
📤 OSL p50 — Run A vs Run B +
Median completion token count per model across both runs.
+
+
+
+
+
📤 OSL p90 — Run A vs Run B +
90th-percentile completion tokens per model across both runs.
+
+
+
+
+
+
+
⚡ TPS — Run A vs Run B +
Completion tokens per second per model across both runs.
+
+
+
+
+
🧜 Cache Rate — Run A vs Run B +
Fraction of prompt tokens served from cache per model.
+
+
+
+
+
+ + +
+
+
⏱💰 Latency vs Cost — Run A vs Run B +
Circles = Run A queries; diamonds = Run B queries. Top-right = slow and + expensive in both runs.
+
+
+
+
+
💵 Cost per 1K Output Tokens — Run A vs Run B +
Effective output cost per model. Lower = more efficient generation.
+
+
+
+
+ + +
+
+
📋 Per-Query Comparison — Run A vs Run B +
All queries from both runs. + Dimmed rows = query only in one run (— in missing columns). + Coloured rows = queries in both runs, with cost delta.
+
+
+ + + + + + + + + +
Query #QuestionCost ACost BΔ CostΔ %ISL AISL BOSL AOSL BDur A (s)Dur B (s)Calls ACalls B
+
+
+
+""" + +_JS_DATA_EXTRAS = r""" +const cmp = DATA.comparison; +""" + +_JS_EXTRA_GLOBALS = r""" +const fmt$N = v => v == null ? '\u2014' : fmt$(v); +const fmtKN = v => v == null ? '\u2014' : fmtK(v); +const fmtTN = v => v == null ? '\u2014' : (+v).toFixed(1); +const fmtCN = v => v == null ? '\u2014' : String(v); + +// ── grouped A-vs-B bar helper ───────────────────────────────────────────────── +function _cmpBar(divId, keysA, keysB, valA, valB, la, lb, ytitle, h) { + const allKeys = [...new Set([...keysA, ...keysB])]; + const mapA = Object.fromEntries(keysA.map((k,i) => [k, valA[i]])); + const mapB = Object.fromEntries(keysB.map((k,i) => [k, valB[i]])); + Plotly.newPlot(divId, [ + { type:'bar', name:la, x:allKeys, y:allKeys.map(k=>mapA[k]||0), marker:{color:'#58a6ff'} }, + { type:'bar', name:lb, x:allKeys, y:allKeys.map(k=>mapB[k]||0), marker:{color:'#3fb950'} }, + ], L({ height:h||300, barmode:'group', yaxis:{title:ytitle}, + xaxis:{automargin:true,tickangle:-25}, margin:{t:20,r:20,b:90,l:70} }), CFG); +} +""" + +_JS_INIT = r""" +// ── INIT ───────────────────────────────────────────────────────────────────── +const la = cmp.label_a; +const lb = cmp.label_b; +document.getElementById('headerMeta').textContent = + la + ' \u2194 ' + lb + ' \u2022 ' + DATA.generated_at + + ' \u2022 ' + cmp.num_common_queries + ' aligned queries'; +""" + +_JS_RENDERS = r""" +// ── OVERVIEW ───────────────────────────────────────────────────────────────── +function renderOverview() { + const costSaved = -cmp.cost_delta; + const savingsPct = -cmp.cost_pct_change; + const callsDelta = cmp.total_llm_calls_b - cmp.total_llm_calls_a; + const ds = v => (v >= 0 ? '+' : '') + v; + document.getElementById('overviewStats').innerHTML = ` +
Run A Total Cost
+
${fmt$(cmp.total_cost_a,2)}
+
${cmp.label_a}
+
Run B Total Cost
+
${fmt$(cmp.total_cost_b,2)}
+
${cmp.label_b}
+
Cost Delta (B \u2212 A)
+
${costSaved>=0?'\u2212':'+'}${fmt$(Math.abs(cmp.cost_delta),2)}
+
${ + savingsPct>=0?savingsPct.toFixed(1)+'% cheaper':(-savingsPct).toFixed(1)+'% costlier' + }
+
Aligned / A / B
+
${cmp.num_common_queries}
+
of ${cmp.num_queries_a} A & ${cmp.num_queries_b} B
+
LLM Calls A \u2192 B
+
${cmp.total_llm_calls_a} \u2192 ${cmp.total_llm_calls_b}
+
${ds(callsDelta)} calls
+
Cache Rate A / B
+
${(cmp.cache_rate_a*100).toFixed(1)}% / ${(cmp.cache_rate_b*100).toFixed(1)}%
+ `; + + // Cost by model A-vs-B + const allMods = [...new Set([...Object.keys(cmp.by_model_a||{}), ...Object.keys(cmp.by_model_b||{})])]; + const filtMods = allMods.filter(m => (cmp.by_model_a[m]||0) > 0.0001 || (cmp.by_model_b[m]||0) > 0.0001); + _cmpBar('overviewModelBar', + filtMods, filtMods, + filtMods.map(m=>cmp.by_model_a[m]||0), filtMods.map(m=>cmp.by_model_b[m]||0), + la, lb, 'Cost (USD)', 280); + + // Cost by phase A-vs-B + const allPhs = [...new Set([...Object.keys(cmp.by_phase_a||{}), ...Object.keys(cmp.by_phase_b||{})])]; + _cmpBar('overviewPhaseBar', + allPhs, allPhs, + allPhs.map(p=>cmp.by_phase_a[p]||0), allPhs.map(p=>cmp.by_phase_b[p]||0), + la, lb, 'Cost (USD)', 280); + + // Per-query comparison table + const pq = cmp.per_query || []; + document.querySelector('#overviewTable tbody').innerHTML = pq.map(q => { + const hasBoth = q.in_both; + const dc = !hasBoth ? '#8b949e' : (q.cost_delta <= 0 ? '#3fb950' : '#f85149'); + const dTxt = !hasBoth ? '\u2014' + : (q.cost_delta<0?'\u2212':q.cost_delta>0?'+':'')+'$'+Math.abs(q.cost_delta).toFixed(4); + const pStr = !hasBoth ? '\u2014' : (q.cost_pct<=0?'':'+') + q.cost_pct.toFixed(1) + '%'; + return ` + ${q.id} + ${fmt$N(q.cost_a)} + ${fmt$N(q.cost_b)} + ${dTxt} + ${pStr} + ${fmtKN(q.isl_a)}${fmtKN(q.isl_b)} + ${fmtKN(q.osl_a)}${fmtKN(q.osl_b)} + ${fmtTN(q.duration_a)}${fmtTN(q.duration_b)} + ${fmtCN(q.llm_calls_a)}${fmtCN(q.llm_calls_b)} + `; + }).join(''); +} + +// ── COST ───────────────────────────────────────────────────────────────────── +function renderCost() { + // Cost by model A-vs-B + const allMods = [...new Set([...Object.keys(cmp.by_model_a||{}), ...Object.keys(cmp.by_model_b||{})])]; + const filtMods = allMods.filter(m => (cmp.by_model_a[m]||0) > 0.0001 || (cmp.by_model_b[m]||0) > 0.0001); + _cmpBar('costCmpModelBar', + filtMods, filtMods, + filtMods.map(m=>cmp.by_model_a[m]||0), filtMods.map(m=>cmp.by_model_b[m]||0), + la, lb, 'Cost (USD)', 300); + + // Cost by phase A-vs-B + const allPhs = [...new Set([...Object.keys(cmp.by_phase_a||{}), ...Object.keys(cmp.by_phase_b||{})])]; + _cmpBar('costCmpPhaseBar', + allPhs, allPhs, + allPhs.map(p=>cmp.by_phase_a[p]||0), allPhs.map(p=>cmp.by_phase_b[p]||0), + la, lb, 'Cost (USD)', 300); + + // Tool cost A-vs-B (if available) + const toolA = Object.keys((DATA.by_tool)||{}).filter(t => (DATA.by_tool[t].total_cost_usd||0) > 0); + const toolBData = cmp.by_tool_b || {}; + const toolB = Object.keys(toolBData).filter(t => (toolBData[t].total_cost_usd||0) > 0); + const allTools = [...new Set([...toolA, ...toolB])]; + if (allTools.length > 0) { + _cmpBar('costCmpToolBar', + allTools, allTools, + allTools.map(t => (DATA.by_tool[t]||{}).total_cost_usd||0), + allTools.map(t => (toolBData[t]||{}).total_cost_usd||0), + la, lb, 'Cost (USD)', 280); + } else { + const tc = document.getElementById('costCmpToolCard'); + if (tc) tc.style.display = 'none'; + } + + // Per-query cost delta bar + const pqBoth = (cmp.per_query||[]).filter(q => q.in_both); + if (pqBoth.length) { + Plotly.newPlot('costCmpDeltaBar', [{ + type:'bar', x:pqBoth.map(q=>'Q'+q.id), y:pqBoth.map(q=>q.cost_delta), + text:pqBoth.map(q=>(q.cost_delta<=0?'':'+')+'$'+(+q.cost_delta).toFixed(4)), + textposition:'outside', + marker:{ color:pqBoth.map(q=>q.cost_delta<=0?'#3fb950':'#f85149') }, + hovertemplate:'Q%{x}: $%{y:.4f}', + }], L({ height:320, + yaxis:{title:'Cost delta: B \u2212 A (USD)', zerolinecolor:'#8b949e', zeroline:true}, + xaxis:{automargin:true, tickangle:-45}, + margin:{t:20,r:20,b:110,l:70}, showlegend:false }), CFG); + } else { + document.getElementById('costCmpDeltaBar').innerHTML = + '

No queries aligned between the two runs.

'; + } +} + +// ── LATENCY ────────────────────────────────────────────────────────────────── +function renderLatency() { + // LLM p50 A-vs-B + const llmA = DATA.llm_latency || {}; + const llmB = cmp.llm_latency_b || {}; + const allLlmModels = [...new Set([...Object.keys(llmA), ...Object.keys(llmB)])]; + _cmpBar('latCmpLlmP50Bar', + allLlmModels, allLlmModels, + allLlmModels.map(m => (llmA[m]||{}).p50_ms||0).map(v=>v/1000), + allLlmModels.map(m => (llmB[m]||{}).p50_ms||0).map(v=>v/1000), + la, lb, 'Seconds (p50)', 300); + + // LLM p90 A-vs-B + _cmpBar('latCmpLlmP90Bar', + allLlmModels, allLlmModels, + allLlmModels.map(m => (llmA[m]||{}).p90_ms||0).map(v=>v/1000), + allLlmModels.map(m => (llmB[m]||{}).p90_ms||0).map(v=>v/1000), + la, lb, 'Seconds (p90)', 300); + + // Tool p90 A-vs-B + const toolA = DATA.tool_latency || {}; + const toolB = cmp.tool_latency_b || {}; + const allTools = [...new Set([...Object.keys(toolA), ...Object.keys(toolB)])] + .filter(t => ((toolA[t]||{}).p90_ms||0) > 10 || ((toolB[t]||{}).p90_ms||0) > 10); + if (allTools.length) { + _cmpBar('latCmpToolP90Bar', + allTools, allTools, + allTools.map(t => (toolA[t]||{}).p90_ms||0).map(v=>v/1000), + allTools.map(t => (toolB[t]||{}).p90_ms||0).map(v=>v/1000), + la, lb, 'Seconds (p90)', 300); + } else { + document.getElementById('latCmpToolP90Bar').innerHTML = + '

No significant tool latency data

'; + } +} + +// ── TOKENS ─────────────────────────────────────────────────────────────────── +function renderTokens() { + const ts = DATA.token_stats || {}; + const bm = ts.by_model || {}; + const tsB = cmp.token_stats_b || {}; + const bmB = tsB.by_model || {}; + + const models = Object.keys(bm); + const modelsB = Object.keys(bmB); + const allModels = [...new Set([...models, ...modelsB])]; + + // Token totals for Run A + const totalPrompt = models.reduce((s,m) => s + (bm[m].total_isl||0), 0); + const totalComp = models.reduce((s,m) => s + (bm[m].total_osl||0), 0); + const totalCached = models.reduce((s,m) => s + (bm[m].total_cached||0), 0); + const totalCalls = models.reduce((s,m) => s + (bm[m].calls||0), 0); + const cacheRate = totalPrompt > 0 ? (totalCached/totalPrompt*100).toFixed(1) : '0'; + + // Token totals for Run B + const totalPromptB = modelsB.reduce((s,m) => s + (bmB[m].total_isl||0), 0); + const totalCompB = modelsB.reduce((s,m) => s + (bmB[m].total_osl||0), 0); + const totalCachedB = modelsB.reduce((s,m) => s + (bmB[m].total_cached||0), 0); + const totalCallsB = modelsB.reduce((s,m) => s + (bmB[m].calls||0), 0); + const cacheRateB = totalPromptB > 0 ? (totalCachedB/totalPromptB*100).toFixed(1) : '0'; + + document.getElementById('tokenStats').innerHTML = ` +
LLM Calls A / B
+
${fmtK(totalCalls)} / ${fmtK(totalCallsB)}
+
Total Prompt A / B
+
${fmtK(totalPrompt)} / ${fmtK(totalPromptB)}
+
ISL tokens
+
Total Completion A / B
+
${fmtK(totalComp)} / ${fmtK(totalCompB)}
+
OSL tokens
+
Cache Rate A / B
+
${cacheRate}% / ${cacheRateB}%
+ `; + + // ISL p50 A-vs-B + _cmpBar('tokenCmpIslP50Bar', allModels, allModels, + allModels.map(m=>(bm[m]||{}).isl_p50||0), allModels.map(m=>(bmB[m]||{}).isl_p50||0), + la, lb, 'ISL p50 (tokens)', 280); + + // ISL p90 A-vs-B + _cmpBar('tokenCmpIslP90Bar', allModels, allModels, + allModels.map(m=>(bm[m]||{}).isl_p90||0), allModels.map(m=>(bmB[m]||{}).isl_p90||0), + la, lb, 'ISL p90 (tokens)', 280); + + // OSL p50 A-vs-B + _cmpBar('tokenCmpOslP50Bar', allModels, allModels, + allModels.map(m=>(bm[m]||{}).osl_p50||0), allModels.map(m=>(bmB[m]||{}).osl_p50||0), + la, lb, 'OSL p50 (tokens)', 280); + + // OSL p90 A-vs-B + _cmpBar('tokenCmpOslP90Bar', allModels, allModels, + allModels.map(m=>(bm[m]||{}).osl_p90||0), allModels.map(m=>(bmB[m]||{}).osl_p90||0), + la, lb, 'OSL p90 (tokens)', 280); + + // TPS A-vs-B + _cmpBar('tokenCmpTpsBar', allModels, allModels, + allModels.map(m=>(bm[m]||{}).tps_mean||0), allModels.map(m=>(bmB[m]||{}).tps_mean||0), + la, lb, 'Completion tokens / second', 280); + + // Cache rate A-vs-B + _cmpBar('tokenCmpCacheBar', allModels, allModels, + allModels.map(m=>((bm[m]||{}).cache_rate||0)*100), + allModels.map(m=>((bmB[m]||{}).cache_rate||0)*100), + la, lb, 'Cache rate (%)', 280); +} + +// ── EFFICIENCY ──────────────────────────────────────────────────────────────── +function renderEfficiency() { + const bm = (DATA.token_stats || {}).by_model || {}; + const models = Object.keys(bm); + + // Scatter: run A circles, run B diamonds + const pqA = DATA.per_query || []; + const pqBData = cmp.per_query || []; + const scatterTraces = []; + if (pqA.length) { + scatterTraces.push({ + type: 'scatter', mode: 'markers+text', + name: la, + x: pqA.map(q => q.duration_s||0), + y: pqA.map(q => q.cost_usd||0), + text: pqA.map(q => 'Q'+q.id), + textposition: 'top center', + textfont: {size:9, color:'#8b949e'}, + marker: { symbol:'circle', color:'#58a6ff', size:9, opacity:.8 }, + hovertemplate: 'Run A Q%{text}
%{x:.1f}s $%{y:.4f}', + }); + } + const pqBBoth = pqBData.filter(q => q.in_both && q.duration_b != null && q.cost_b != null); + if (pqBBoth.length) { + scatterTraces.push({ + type: 'scatter', mode: 'markers+text', + name: lb, + x: pqBBoth.map(q => q.duration_b||0), + y: pqBBoth.map(q => q.cost_b||0), + text: pqBBoth.map(q => 'Q'+q.id), + textposition: 'top center', + textfont: {size:9, color:'#8b949e'}, + marker: { symbol:'diamond', color:'#3fb950', size:9, opacity:.8 }, + hovertemplate: 'Run B Q%{text}
%{x:.1f}s $%{y:.4f}', + }); + } + Plotly.newPlot('effCmpScatter', scatterTraces, + L({ height:380, xaxis:{title:'Workflow duration (s)'}, + yaxis:{title:'Total cost (USD)'}, + margin:{t:20,r:20,b:60,l:70} }), CFG); + + // Cost per 1K OSL A-vs-B + const bmB = (cmp.token_stats_b||{}).by_model || {}; + const allMods = [...new Set([...models, ...Object.keys(bmB)])]; + const cpkA = allMods.map(m => bm[m] && bm[m].total_osl > 0 ? (DATA.by_model[m]||0)/(bm[m].total_osl/1000) : 0); + const cpkB = allMods.map( + m => bmB[m] && bmB[m].total_osl > 0 ? ((cmp.by_model_b||{})[m]||0)/(bmB[m].total_osl/1000) : 0); + _cmpBar('effCmpCostPerKOslBar', + allMods, allMods, cpkA, cpkB, + la, lb, '$ per 1K completion tokens', 300); +} + +// ── PER-QUERY DETAIL ────────────────────────────────────────────────────────── +function renderDetail() { + const pq = cmp.per_query || []; + const tbody = document.querySelector('#detailCmpTable tbody'); + tbody.innerHTML = pq.map(q => { + const hasBoth = q.in_both; + const dc = !hasBoth ? '#8b949e' : (q.cost_delta <= 0 ? '#3fb950' : '#f85149'); + const dTxt = !hasBoth ? '\u2014' + : (q.cost_delta<0?'\u2212':q.cost_delta>0?'+':'')+'$'+Math.abs(q.cost_delta).toFixed(4); + const pStr = !hasBoth ? '\u2014' : (q.cost_pct<=0?'':'+') + q.cost_pct.toFixed(1) + '%'; + const qtxt = (q.question||'').substring(0,80) + (q.question&&q.question.length>80?'\u2026':''); + return ` + ${q.id} + ${qtxt||'\u2014'} + ${fmt$N(q.cost_a)} + ${fmt$N(q.cost_b)} + ${dTxt} + ${pStr} + ${fmtKN(q.isl_a)} + ${fmtKN(q.isl_b)} + ${fmtKN(q.osl_a)} + ${fmtKN(q.osl_b)} + ${fmtTN(q.duration_a)} + ${fmtTN(q.duration_b)} + ${fmtCN(q.llm_calls_a)} + ${fmtCN(q.llm_calls_b)} + `; + }).join(''); +} +""" + +_HTML = build_html( + title="AIQ Tokenomics Report \u2014 Comparison", + tab_html=_TAB_HTML, + js_data_extras=_JS_DATA_EXTRAS, + js_extra_globals=_JS_EXTRA_GLOBALS, + js_init=_JS_INIT, + js_renders=_JS_RENDERS, +) + + +def render_html(report_data: dict) -> str: + return _HTML.replace("__REPORT_DATA_JSON__", json.dumps(report_data, ensure_ascii=False)) diff --git a/src/aiq_agent/tokenomics/report/_report_template_single.py b/src/aiq_agent/tokenomics/report/_report_template_single.py new file mode 100644 index 00000000..efa6acb1 --- /dev/null +++ b/src/aiq_agent/tokenomics/report/_report_template_single.py @@ -0,0 +1,765 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +"""HTML template and render helper for single-run tokenomics reports. No project imports.""" + +from __future__ import annotations + +import json + +from ._report_base import build_html + +_TAB_HTML = r""" + +
+
+
+
+
🤖 Cost by Model +
Which model is consuming most of the budget?
+
+
+
+
+
🏗 Cost by Phase +
Orchestrator = reasoning overhead; + Researcher = parallel search calls. High Researcher share means many tool-heavy sub-tasks.
+
+
+
+
+
+
📋 Per-Query Summary
+
+ + + + + + +
Query #Cost ($)Prompt (ISL)Completion (OSL)CachedCache %LLM CallsDuration (s)
+
+
+
+ + +
+
+
+
🥧 Cost Split by Model +
Hover for exact values. A single dominant slice means one model drives + nearly all spend.
+
+
+
+
+
🏗 Cost by Phase +
Total spend per phase summed across all queries. Orchestrator dominance + is normal; unexpectedly high Researcher cost suggests overly broad search loops.
+
+
+
+
+
+
🔍 Tool API Cost by Tool +
Per-call cost × invocation count for each tool. These charges are separate + from LLM token costs. High search costs relative to LLM costs suggest reducing max_results or + switching to a cheaper search provider.
+
+
+
+
+
📦 Cost by Phase per Query +
Spot outlier queries and identify which phase drove the extra cost. Uniform + bars = consistent workload; spikes = difficult queries.
+
+
+
+
+ + +
+
+
+
📊 LLM Latency Percentiles by Model +
A large gap between p50 and p99 means occasional very long completions — + usually caused by high OSL. If p50 is already slow, the bottleneck is network or server load. +
+
+
+
+
+
🔍 Tool Latency Percentiles +
Search/web tools typically run 3–8 s. p90 above 10 s signals a retrieval + bottleneck that adds directly to total query time.
+
+
+
+
+
+ + +
+
+
+
+
📥 ISL (Input Sequence Length) — p50 / p90 / p99 by Model +
Prompt token counts sent to each model. A rising p99 vs p50 means some calls + hit much larger contexts — check ISL Growth below to see when.
+
+
+
+
+
📤 OSL (Output Sequence Length) — p50 / p90 / p99 by Model +
Completion token counts. High p99 OSL means some calls produce very long + reasoning chains or verbose outputs, which directly drives both cost and latency.
+
+
+
+
+
+
+
📈 Context Accumulation — Avg ISL by Call Index +
How prompt size grows over sequential LLM calls within a query. An upward + slope means the model is accumulating conversation history. A plateau suggests caching or a + fresh-start pattern. The dashed line is the estimated system-prompt floor (minimum ISL + observed).
+
+
+
+
+
⚡ Throughput — Completion Tokens / Second +
Inference speed per model. Low TPS with small OSL often indicates network + round-trip overhead rather than slow generation. Compare models to spot which is the throughput + bottleneck.
+
+
+
+
+
+
🔮 NOVA-Predicted vs Actual OSL +
Each dot is one LLM call. Points on the diagonal line = perfect prediction. + Points above = model generated more than predicted (underestimate). Points below = model generated + less (overestimate). Tight clustering around the diagonal means NAT's routing hints are accurate. +
+
+
+
+
+
+
🧜 Token Budget — Cached vs Uncached vs Completion +
Green = tokens served from cache (billed at the cheaper cached rate). Grey + = uncached prompt tokens (full price). Blue = completion tokens (most expensive per token). + Maximise green to reduce cost.
+
+
+
+
+
🔗 ISL vs Latency — Is Prompt Size the Bottleneck? +
Each dot is one LLM call. A diagonal trend means longer prompts take longer + (prompt-bound). A flat cloud means latency is driven by output length or server capacity, not + context size.
+
+
+
+
+
+
🏗 Token Mix by Phase +
Total tokens consumed across Orchestrator / Planner / Researcher phases. + Cached (green) vs uncached (grey) prompt tokens show how well each phase leverages the prompt + cache. Reasoning tokens (purple) are non-billed thinking tokens where applicable.
+
+
+
+
+
📋 Token Summary Table (by model)
+
+ + + + + + + + + + +
ModelCallsAvg ISLp90 ISLMax ISLAvg OSLp90 OSLMax OSLTotal PromptTotal CompletionTotal CachedCache RateAvg TPSSys Prompt Est.
+
+
+
+ + +
+
+
⏱💰 Latency vs Cost per Query +
Each dot is one query. Queries in the top-right are both slow and expensive — + highest priority for optimization. A diagonal cluster means slow queries are inherently costlier + (more LLM calls). Outliers far from the cluster are worth investigating individually.
+
+
+
+
+
+
⚡📉 TPS vs ISL — Does Throughput Drop as Context Grows? +
Each dot is one LLM call. A downward slope means longer prompts hurt + inference speed (prompt-bound). A flat cloud means generation speed is independent of context + size (compute-bound). Use this to decide whether KV-cache optimizations would help.
+
+
+
+
+
💵 Effective Cost per 1K Output Tokens by Model +
Total spend divided by total completion tokens generated — the true output + cost. A model with cheaper listed pricing may still be more expensive here if it generates more + tokens to answer the same question.
+
+
+
+
+
+
🎯 Model Efficiency — Output Cost vs p90 Latency +
Each point is a model. Bottom-left is ideal: cheap output AND fast. Use this + to compare model trade-offs when evaluating alternatives. Bubble size = total LLM call count. +
+
+
+
+
+ + +
+
+
📋 Per-Query Token & Cost Detail
+
+ + + + + + + + +
Query #Cost ($)Prompt (ISL)Completion (OSL)CachedISL:OSLLLM CallsDuration (s)Question
+
+
+
+""" + +_JS_INIT = r""" +// ── INIT ───────────────────────────────────────────────────────────────────── +document.getElementById('headerMeta').textContent = + DATA.label + ' \u2022 ' + DATA.generated_at + ' \u2022 ' + DATA.num_queries + ' queries'; +""" + +_JS_RENDERS = r""" +// ── OVERVIEW ───────────────────────────────────────────────────────────────── +function renderOverview() { + const d = DATA; + const cr = d.total_prompt_tokens > 0 + ? (d.total_cached_tokens / d.total_prompt_tokens * 100).toFixed(1) : '0'; + document.getElementById('overviewStats').innerHTML = ` +
Queries
+
${d.num_queries}
+
Total Cost
+
${fmt$(d.total_cost_usd,2)}
+
${fmt$(d.avg_cost_usd,4)}/query avg
+
LLM Cost
+
${fmt$(d.llm_cost_usd,2)}
+
token charges
+
Tool API Cost
+
${fmt$(d.tool_cost_usd,2)}
+
search / external APIs
+
Cache Savings
+
${fmt$(d.cache_savings_usd,2)}
+
${cr}% cache rate
+
Total Prompt
+
${fmtK(d.total_prompt_tokens)}
+
ISL tokens
+
Total Completion
+
${fmtK(d.total_completion_tokens)}
+
OSL tokens
+
Total LLM Calls
+
${d.total_llm_calls}
+ `; + + // Cost by model bar + const mods = Object.keys(d.by_model).filter(m => d.by_model[m] > 0.0001); + Plotly.newPlot('overviewModelBar', [{ + type: 'bar', x: mods, y: mods.map(m => d.by_model[m]), + text: mods.map(m => fmt$(d.by_model[m],3)), textposition: 'outside', + marker: { color: PALETTE }, + }], L({ height: 280, yaxis: {title:'Cost (USD)'}, xaxis: {automargin:true,tickangle:-25}, + margin: {t:20,r:20,b:90,l:70}, showlegend: false }), CFG); + + // Cost by phase horizontal bar + const phases = Object.keys(d.by_phase); + Plotly.newPlot('overviewPhaseBar', [{ + type: 'bar', orientation: 'h', + y: phases, x: phases.map(p => d.by_phase[p]), + text: phases.map(p => fmt$(d.by_phase[p],3)), textposition: 'outside', + marker: { color: '#bc8cff' }, + }], L({ height: 280, xaxis: {title:'Cost (USD)'}, yaxis: {automargin:true}, + margin: {t:20,r:80,b:50,l:140} }), CFG); + + // Per-query table + const tbody = document.querySelector('#overviewTable tbody'); + tbody.innerHTML = d.per_query.map(q => { + const cr2 = q.input_tokens > 0 ? (q.cached_tokens/q.input_tokens*100).toFixed(1)+'%' : '\u2014'; + return ` + ${q.id} + ${fmt$(q.cost_usd)} + ${(q.input_tokens||0).toLocaleString()} + ${(q.output_tokens||0).toLocaleString()} + ${(q.cached_tokens||0).toLocaleString()} + ${cr2} + ${q.entry_count||0} + ${(q.duration_s||0).toFixed(1)} + `; + }).join(''); +} + +// ── COST ───────────────────────────────────────────────────────────────────── +function renderCost() { + const d = DATA; + const mods = Object.keys(d.by_model).filter(m => d.by_model[m] > 0.0001); + + // Donut by model + Plotly.newPlot('costPie', [{ + type: 'pie', labels: mods, values: mods.map(m => d.by_model[m]), + hole: .45, textfont: { color: '#e6edf3' }, + marker: { colors: PALETTE }, + }], L({ height: 320, showlegend: true, margin: {t:20,r:120,b:20,l:20} }), CFG); + + // Horizontal bar by phase + const phases = Object.keys(d.by_phase); + Plotly.newPlot('costPhaseBar', [{ + type: 'bar', orientation: 'h', + y: phases, x: phases.map(p => d.by_phase[p]), + text: phases.map(p => fmt$(d.by_phase[p],3)), textposition: 'outside', + marker: { color: '#bc8cff' }, + }], L({ height: 320, xaxis:{title:'Cost (USD)'}, yaxis:{automargin:true}, + margin:{t:20,r:80,b:50,l:140} }), CFG); + + // Tool cost bar + const toolData = d.by_tool || {}; + const toolCard = document.getElementById('toolCostCard'); + const toolNames = Object.keys(toolData).filter(t => toolData[t].total_cost_usd > 0 || toolData[t].calls > 0); + if (toolNames.length > 0) { + const toolCosts = toolNames.map(t => toolData[t].total_cost_usd); + const toolCalls = toolNames.map(t => toolData[t].calls); + const hasCost = toolCosts.some(c => c > 0); + if (hasCost) { + Plotly.newPlot('toolCostBar', [{ + type: 'bar', x: toolNames, y: toolCosts, + text: toolNames.map((t,i) => fmt$(toolCosts[i],3) + ' (' + toolCalls[i] + ' calls)'), + textposition: 'outside', + marker: { color: '#39d353' }, + }], L({ height: 280, yaxis:{title:'Cost (USD)'}, xaxis:{automargin:true,tickangle:-25}, + margin:{t:20,r:20,b:90,l:70}, showlegend:false }), CFG); + } else { + // Show call counts even when all costs are $0 (tools not priced) + Plotly.newPlot('toolCostBar', [{ + type: 'bar', x: toolNames, y: toolCalls, + text: toolCalls.map(c => c + ' calls'), textposition: 'outside', + marker: { color: '#58a6ff' }, + }], L({ height: 280, yaxis:{title:'Call Count'}, xaxis:{automargin:true,tickangle:-25}, + margin:{t:20,r:20,b:90,l:70}, showlegend:false }), CFG); + if (toolCard) { + const sub = toolCard.querySelector('.card-sub'); + if (sub) { + sub.textContent = + 'Tool call counts shown (no cost data \u2014 add tool prices to ' + + 'tokenomics.pricing.tools in the config to see cost breakdown).'; + } + } + } + } else { + if (toolCard) toolCard.style.display = 'none'; + } + + // Stacked bar: cost by phase per query + const stackTraces = (d.phase_order||[]).map(ph => ({ + type: 'bar', name: ph, + x: d.per_query.map(q => 'Q' + q.id), + y: d.per_query.map(q => (q.by_phase||{})[ph]||0), + marker: { color: PHASE_COLORS[ph]||'#8b949e' }, + })); + Plotly.newPlot('costPerQueryStack', stackTraces, + L({ height: 300, barmode: 'stack', yaxis:{title:'Cost (USD)'}, + xaxis:{automargin:true,tickangle:-25}, margin:{t:20,r:20,b:90,l:70} }), CFG); +} + +// ── LATENCY ────────────────────────────────────────────────────────────────── +function renderLatency() { + const d = DATA; + + // LLM percentile bars + const llmE = Object.entries(d.llm_latency||{}).sort((a,b) => b[1].p90_ms - a[1].p90_ms); + if (llmE.length) { + const names = llmE.map(e => e[0]); + Plotly.newPlot('llmLatencyBar', [ + { type:'bar', name:'p50', x:names, y:llmE.map(e=>e[1].p50_ms/1000), marker:{color:'#3fb950'} }, + { type:'bar', name:'p90', x:names, y:llmE.map(e=>e[1].p90_ms/1000), marker:{color:'#58a6ff'} }, + { type:'bar', name:'p99', x:names, y:llmE.map(e=>e[1].p99_ms/1000), marker:{color:'#f85149'} }, + ], L({ height:320, barmode:'group', yaxis:{title:'Seconds'}, xaxis:{automargin:true,tickangle:-30}, + margin:{t:20,r:20,b:100,l:60} }), CFG); + } else { + document.getElementById('llmLatencyBar').innerHTML = + '

' + + 'No LLM latency data (missing span_event_timestamp?)

'; + } + + // Tool percentile bars (skip near-zero tools) + const toolE = Object.entries(d.tool_latency||{}) + .filter(([k,v]) => v.p90_ms > 10) + .sort((a,b) => b[1].p90_ms - a[1].p90_ms) + .slice(0, 12); + if (toolE.length) { + const tnames = toolE.map(e => e[0]); + Plotly.newPlot('toolLatencyBar', [ + { type:'bar', name:'p50', x:tnames, y:toolE.map(e=>e[1].p50_ms/1000), marker:{color:'#3fb950'} }, + { type:'bar', name:'p90', x:tnames, y:toolE.map(e=>e[1].p90_ms/1000), marker:{color:'#58a6ff'} }, + { type:'bar', name:'p99', x:tnames, y:toolE.map(e=>e[1].p99_ms/1000), marker:{color:'#f85149'} }, + ], L({ height:320, barmode:'group', yaxis:{title:'Seconds'}, xaxis:{automargin:true,tickangle:-30}, + margin:{t:20,r:20,b:100,l:60} }), CFG); + } else { + document.getElementById('toolLatencyBar').innerHTML = + '

No significant tool latency data

'; + } +} + +// ── TOKENS ─────────────────────────────────────────────────────────────────── +function renderTokens() { + const ts = DATA.token_stats || {}; + const bm = ts.by_model || {}; + const bc = ts.by_component || {}; + const spl = ts.isl_latency_sample || []; + const grw = ts.isl_growth || {}; + const sys = ts.sys_prompt_est || {}; + + const models = Object.keys(bm); + const colorOf = m => PALETTE[models.indexOf(m) % PALETTE.length]; + + // Stat grid + const totalPrompt = models.reduce((s,m) => s + (bm[m].total_isl||0), 0); + const totalComp = models.reduce((s,m) => s + (bm[m].total_osl||0), 0); + const totalCached = models.reduce((s,m) => s + (bm[m].total_cached||0), 0); + const totalCalls = models.reduce((s,m) => s + (bm[m].calls||0), 0); + const cacheRate = totalPrompt > 0 ? (totalCached/totalPrompt*100).toFixed(1) : '0'; + + document.getElementById('tokenStats').innerHTML = ` +
Total LLM Calls
+
${fmtK(totalCalls)}
+
Total Prompt
+
${fmtK(totalPrompt)}
+
ISL tokens
+
Total Completion
+
${fmtK(totalComp)}
+
OSL tokens
+
Total Cached
+
${fmtK(totalCached)}
+
${cacheRate}% cache rate
+
ISL:OSL Ratio
+
${totalComp > 0 ? (totalPrompt/totalComp).toFixed(1) : '\u2014'}:1
+ `; + + // ISL p50/p90/p99 by model + Plotly.newPlot('islBar', [ + { type:'bar', name:'p50', x:models, y:models.map(m=>bm[m].isl_p50||0), marker:{color:'#3fb950'} }, + { type:'bar', name:'p90', x:models, y:models.map(m=>bm[m].isl_p90||0), marker:{color:'#58a6ff'} }, + { type:'bar', name:'p99', x:models, y:models.map(m=>bm[m].isl_p99||0), marker:{color:'#f85149'} }, + ], L({ height:300, barmode:'group', yaxis:{title:'Tokens'}, xaxis:{automargin:true,tickangle:-25}, + margin:{t:20,r:20,b:90,l:70}, + annotations: models.map(m => ({ + x:m, y:bm[m].isl_max||0, text:'max '+fmtK(bm[m].isl_max||0), + showarrow:false, font:{size:9,color:'#8b949e'}, yshift:4, + })) + }), CFG); + + // OSL p50/p90/p99 by model + Plotly.newPlot('oslBar', [ + { type:'bar', name:'p50', x:models, y:models.map(m=>bm[m].osl_p50||0), marker:{color:'#3fb950'} }, + { type:'bar', name:'p90', x:models, y:models.map(m=>bm[m].osl_p90||0), marker:{color:'#58a6ff'} }, + { type:'bar', name:'p99', x:models, y:models.map(m=>bm[m].osl_p99||0), marker:{color:'#f85149'} }, + ], L({ height:300, barmode:'group', yaxis:{title:'Tokens'}, xaxis:{automargin:true,tickangle:-25}, + margin:{t:20,r:20,b:90,l:70} }), CFG); + + // ISL growth (context accumulation) + const growthTraces = Object.entries(grw).map(([model, pts], i) => ({ + type:'scatter', mode:'lines+markers', name: model, + x: pts.map(p=>p.idx), y: pts.map(p=>p.avg_isl), + line: { color: colorOf(model), width: 2 }, + marker: { size: 5, color: colorOf(model) }, + hovertemplate: model + '
Call #%{x}
Avg ISL: %{y:,.0f} tokens', + })); + // Dashed sys-prompt estimate lines + Object.entries(sys).forEach(([model, minIsl], i) => { + const maxIdx = Math.max(...((grw[model]||[{idx:0}]).map(p=>p.idx)), 10); + growthTraces.push({ + type:'scatter', mode:'lines', name: model+' sys-prompt est.', + x: [0, maxIdx], y: [minIsl, minIsl], + line: { color: colorOf(model), width: 1, dash: 'dot' }, + hovertemplate: 'Sys-prompt lower bound: ' + fmtK(minIsl) + ' tokens', + }); + }); + Plotly.newPlot('islGrowth', growthTraces, + L({ height:320, xaxis:{title:'Call index within query', dtick:5}, yaxis:{title:'Avg ISL (tokens)'}, + margin:{t:20,r:20,b:60,l:80}, + annotations:[{text:'Dashed = system-prompt lower bound (min ISL observed)', + x:0.01, y:0.97, xref:'paper', yref:'paper', showarrow:false, + font:{color:'#8b949e',size:10}}] + }), CFG); + + // TPS bar + const tpsSorted = models.map(m=>({m, tps:bm[m].tps_mean||0})).sort((a,b)=>b.tps-a.tps); + Plotly.newPlot('tpsBar', [{ + type:'bar', x:tpsSorted.map(d=>d.m), y:tpsSorted.map(d=>d.tps), + text:tpsSorted.map(d=>d.tps.toFixed(1)+' tok/s'), textposition:'outside', + marker:{color: tpsSorted.map((_,i) => PALETTE[i%PALETTE.length])}, + }], L({ height:300, yaxis:{title:'Completion tokens / second'}, + xaxis:{automargin:true,tickangle:-25}, margin:{t:20,r:20,b:90,l:70}, showlegend:false }), CFG); + + // Cache breakdown stacked bar + Plotly.newPlot('cacheBreakdown', [ + { + type:'bar', name:'Cached prompt', x:models, + y:models.map(m=>bm[m].total_cached||0), marker:{color:'#39d353'}, + }, + { + type:'bar', name:'Uncached prompt', x:models, + y:models.map(m=>(bm[m].total_isl||0)-(bm[m].total_cached||0)), + marker:{color:'#30363d'}, + }, + { + type:'bar', name:'Completion', x:models, + y:models.map(m=>bm[m].total_osl||0), marker:{color:'#58a6ff'}, + }, + ], L({ height:320, barmode:'stack', yaxis:{title:'Tokens'}, + xaxis:{automargin:true,tickangle:-25}, margin:{t:20,r:20,b:90,l:80} }), CFG); + + // ISL vs Latency scatter + const modelsSeen = [...new Set(spl.map(p=>p.model))]; + const scatterTraces = modelsSeen.map(m => { + const pts = spl.filter(p=>p.model===m); + return { + type:'scatter', mode:'markers', name:m, + x: pts.map(p=>p.isl), y: pts.map(p=>p.dur_s), + marker: { color: colorOf(m), size: 5, opacity: .55 }, + hovertemplate: 'ISL: %{x:,}
Latency: %{y:.2f}s'+m+'', + }; + }); + Plotly.newPlot('islLatencyScatter', scatterTraces, + L({ height:320, xaxis:{title:'Prompt tokens (ISL)'}, yaxis:{title:'Latency (s)'}, + margin:{t:20,r:20,b:60,l:70} }), CFG); + + // Component token stacked bar (by phase) + const comps = Object.keys(bc); + Plotly.newPlot('componentTokenStack', [ + { + type:'bar', name:'Prompt (uncached)', x:comps, + y:comps.map(c=>(bc[c].total_isl||0)-(bc[c].total_cached||0)), + marker:{color:'#30363d'}, + }, + { + type:'bar', name:'Prompt (cached)', x:comps, + y:comps.map(c=>bc[c].total_cached||0), marker:{color:'#39d353'}, + }, + { + type:'bar', name:'Completion', x:comps, + y:comps.map(c=>bc[c].total_osl||0), marker:{color:'#58a6ff'}, + }, + { + type:'bar', name:'Reasoning', x:comps, + y:comps.map(c=>bc[c].total_reasoning||0), marker:{color:'#bc8cff'}, + }, + ], L({ height:320, barmode:'stack', yaxis:{title:'Tokens'}, + xaxis:{automargin:true,tickangle:-25}, margin:{t:20,r:20,b:90,l:80} }), CFG); + + // Predicted vs Actual OSL scatter + const pva = ts.predicted_vs_actual || []; + const predCard = document.getElementById('predVsActualCard'); + // Hide when all predicted == actual (post-hoc filled, no predictive signal) + const hasRealPredictions = pva.some(p => p.predicted !== p.actual); + if (pva.length === 0 || !hasRealPredictions) { + if (predCard) predCard.style.display = 'none'; + } else { + const pvaModels = [...new Set(pva.map(p => p.model))]; + const pvaTraces = pvaModels.map(m => { + const pts = pva.filter(p => p.model === m); + const errs = pts.map(p => p.actual - p.predicted); + const pct = pts.map(p => p.predicted > 0 ? ((p.actual - p.predicted) / p.predicted * 100) : 0); + return { + type: 'scatter', mode: 'markers', name: m, + x: pts.map(p => p.predicted), y: pts.map(p => p.actual), + customdata: pts.map((p, i) => [errs[i], pct[i].toFixed(1)]), + marker: { color: colorOf(m), size: 5, opacity: .65 }, + hovertemplate: + 'Predicted: %{x:,}
Actual: %{y:,}
Error: %{customdata[0]:+,} ' + + '(%{customdata[1]}%)' + m + '', + }; + }); + // Perfect-prediction diagonal + const allVals = pva.flatMap(p => [p.predicted, p.actual]); + const axMax = Math.max(...allVals) * 1.05; + pvaTraces.push({ + type: 'scatter', mode: 'lines', name: 'Perfect prediction', + x: [0, axMax], y: [0, axMax], + line: { color: '#8b949e', width: 1, dash: 'dot' }, + hoverinfo: 'skip', + }); + Plotly.newPlot('predVsActualScatter', pvaTraces, + L({ height: 360, xaxis: {title: 'NOVA Predicted OSL (tokens)', range: [0, axMax]}, + yaxis: {title: 'Actual OSL (tokens)', range: [0, axMax]}, + margin: {t:20,r:20,b:60,l:80} }), CFG); + } + + // Token summary table + const tbody = document.querySelector('#tokenTable tbody'); + tbody.innerHTML = models.map(m => { + const s = bm[m]; + const est = sys[m]; + return ` + ${m} + ${(s.calls||0).toLocaleString()} + ${fmtK(s.isl_mean||0)} + ${fmtK(s.isl_p90||0)} + ${fmtK(s.isl_max||0)} + ${fmtK(s.osl_mean||0)} + ${fmtK(s.osl_p90||0)} + ${fmtK(s.osl_max||0)} + ${fmtK(s.total_isl||0)} + ${fmtK(s.total_osl||0)} + ${fmtK(s.total_cached||0)} + ${((s.cache_rate||0)*100).toFixed(1)}% + ${(s.tps_mean||0).toFixed(1)} tok/s + ~${ + est != null ? fmtK(est) : 'N/A'} + `; + }).join(''); +} + +// ── EFFICIENCY ──────────────────────────────────────────────────────────────── +function renderEfficiency() { + const d = DATA; + const bm = (d.token_stats || {}).by_model || {}; + const ll = d.llm_latency || {}; + const spl = (d.token_stats || {}).isl_latency_sample || []; + const models = Object.keys(bm); + + // Per-query latency vs cost scatter + const pq = d.per_query || []; + const costs = pq.map(q => q.cost_usd || 0); + const durs = pq.map(q => q.duration_s || 0); + Plotly.newPlot('latCostScatter', [{ + type: 'scatter', mode: 'markers+text', + x: durs, y: costs, + text: pq.map(q => 'Q' + q.id), + textposition: 'top center', + textfont: { size: 10, color: '#8b949e' }, + marker: { + color: costs, + colorscale: 'Viridis', + size: 10, opacity: .8, + colorbar: { title: 'Cost ($)', thickness: 12, len: .7 }, + }, + hovertemplate: 'Query %{text}
Duration: %{x:.1f}s
Cost: $%{y:.4f}', + }], L({ height: 380, xaxis: {title: 'Workflow duration (s)'}, + yaxis: {title: 'Total cost (USD)'}, + margin: {t:20,r:80,b:60,l:70}, showlegend: false }), CFG); + + // TPS vs ISL scatter (from isl_latency_sample) + const modelsSeen = [...new Set(spl.map(p => p.model))]; + const tpsIslTraces = modelsSeen.map(m => { + const pts = spl.filter(p => p.model === m && p.dur_s > 0 && p.osl > 0); + return { + type: 'scatter', mode: 'markers', name: m, + x: pts.map(p => p.isl), + y: pts.map(p => p.osl / p.dur_s), + marker: { color: PALETTE[models.indexOf(m) % PALETTE.length], size: 5, opacity: .55 }, + hovertemplate: 'ISL: %{x:,}
TPS: %{y:.1f}' + m + '', + }; + }); + Plotly.newPlot('tpsIslScatter', tpsIslTraces, + L({ height: 340, xaxis: {title: 'Prompt tokens (ISL)'}, + yaxis: {title: 'Completion tokens / second (TPS)'}, + margin: {t:20,r:20,b:60,l:70} }), CFG); + + // Effective cost per 1K output tokens + const cpk = models.map(m => ({ + m, + val: bm[m].total_osl > 0 ? (d.by_model[m] || 0) / (bm[m].total_osl / 1000) : 0, + })).sort((a, b) => b.val - a.val); + Plotly.newPlot('costPerKOslBar', [{ + type: 'bar', x: cpk.map(d => d.m), y: cpk.map(d => d.val), + text: cpk.map(d => '$' + d.val.toFixed(4)), textposition: 'outside', + marker: { color: cpk.map((_, i) => PALETTE[i % PALETTE.length]) }, + }], L({ height: 300, yaxis: {title: '$ per 1K completion tokens'}, + xaxis: {automargin: true, tickangle: -25}, + margin: {t:20,r:20,b:90,l:80}, showlegend: false }), CFG); + + // Model efficiency: output cost vs p90 latency bubble + const effModels = models.filter(m => bm[m].total_osl > 0 && ll[m]); + if (effModels.length > 0) { + const cpkMap = Object.fromEntries(cpk.map(d => [d.m, d.val])); + Plotly.newPlot('modelEfficiencyScatter', [{ + type: 'scatter', mode: 'markers+text', + x: effModels.map(m => (ll[m].p90_ms || 0) / 1000), + y: effModels.map(m => cpkMap[m] || 0), + text: effModels.map(m => m.split('/').pop()), + textposition: 'top center', + textfont: { size: 11 }, + marker: { + size: effModels.map(m => Math.max(14, Math.min(50, (bm[m].calls || 0) / 5))), + color: effModels.map((_, i) => PALETTE[i % PALETTE.length]), + opacity: .8, line: {width: 1, color: '#30363d'}, + }, + hovertemplate: effModels.map(m => + '' + m + '
p90 latency: ' + ((ll[m].p90_ms||0)/1000).toFixed(1) + 's
' + + 'Cost/1K out: $' + (cpkMap[m]||0).toFixed(4) + '
Calls: ' + (bm[m].calls||0) + ''), + }], L({ height: 380, xaxis: {title: 'p90 LLM latency (s)'}, + yaxis: {title: '$ per 1K completion tokens'}, + margin: {t:20,r:20,b:60,l:80}, showlegend: false, + annotations: [{text: 'Bubble size = call count. Bottom-left = cheapest + fastest.', + x: .01, y: .99, xref: 'paper', yref: 'paper', showarrow: false, + font: {color: '#8b949e', size: 10}}] }), CFG); + } else { + document.getElementById('modelEfficiencyScatter').innerHTML = + '

Not enough model diversity for comparison

'; + } +} + +// ── PER-QUERY DETAIL ────────────────────────────────────────────────────────── +function renderDetail() { + const tbody = document.querySelector('#detailTable tbody'); + tbody.innerHTML = DATA.per_query.map(q => { + const isl = q.input_tokens||0, osl = q.output_tokens||0; + const ratio = osl > 0 ? (isl/osl).toFixed(1)+':1' : '\u2014'; + const qtxt = q.question ? q.question.substring(0,120)+(q.question.length>120?'\u2026':'') : '\u2014'; + return ` + ${q.id} + ${fmt$(q.cost_usd)} + ${isl.toLocaleString()} + ${osl.toLocaleString()} + ${(q.cached_tokens||0).toLocaleString()} + ${ratio} + ${q.entry_count||0} + ${(q.duration_s||0).toFixed(1)} + ${qtxt} + `; + }).join(''); +} +""" + +_HTML = build_html( + title="AIQ Tokenomics Report", + tab_html=_TAB_HTML, + js_data_extras="", + js_extra_globals="", + js_init=_JS_INIT, + js_renders=_JS_RENDERS, +) + + +def render_html(report_data: dict) -> str: + return _HTML.replace("__REPORT_DATA_JSON__", json.dumps(report_data, ensure_ascii=False)) diff --git a/tests/tokenomics/test_nat_adapter.py b/tests/tokenomics/test_nat_adapter.py new file mode 100644 index 00000000..e7d38df6 --- /dev/null +++ b/tests/tokenomics/test_nat_adapter.py @@ -0,0 +1,193 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Tests for NAT trace parsing and phase inference. + +Module under test: src/aiq_agent/tokenomics/nat_adapter.py +""" + +import json +from pathlib import Path + +import pytest + +from aiq_agent.tokenomics.nat_adapter import _build_task_windows +from aiq_agent.tokenomics.nat_adapter import _extract_subagent_type +from aiq_agent.tokenomics.nat_adapter import _infer_phase +from aiq_agent.tokenomics.nat_adapter import _TaskWindow +from aiq_agent.tokenomics.nat_adapter import parse_trace +from aiq_agent.tokenomics.pricing import PricingRegistry +from aiq_agent.tokenomics.profile import PHASE_ORCHESTRATOR +from aiq_agent.tokenomics.profile import PHASE_PLANNER +from aiq_agent.tokenomics.profile import PHASE_RESEARCHER + + +def _payload( + event_type: str, + ts: float, + uuid: str = "step-uuid", + **kwargs: object, +) -> dict: + p: dict = {"event_type": event_type, "event_timestamp": ts, "UUID": uuid} + p.update(kwargs) + return {"payload": p} + + +@pytest.mark.parametrize( + "raw,expected", + [ + ({"subagent_type": "planner-agent"}, "planner-agent"), + ({"subagent_type": "researcher-agent"}, "researcher-agent"), + ("{'subagent_type': 'planner-agent', 'description': 'x'}", "planner-agent"), + ("malformed but researcher-agent string", "researcher-agent"), + ], +) +def test_extract_subagent_type(raw, expected): + assert _extract_subagent_type(raw) == expected + + +def test_extract_subagent_type_none(): + assert _extract_subagent_type(None) is None + assert _extract_subagent_type({}) is None + assert _extract_subagent_type("no marker here") is None + + +def test_build_task_windows_closes_pairs(): + steps = [ + _payload( + "TOOL_START", + 10.0, + uuid="t1", + name="task", + data={"input": {"subagent_type": "planner-agent"}}, + ), + _payload("TOOL_END", 20.0, uuid="t1", name="task"), + ] + wins = _build_task_windows(steps) + assert len(wins) == 1 + assert wins[0].subagent_type == "planner-agent" + assert wins[0].start_ts == 10.0 + assert wins[0].end_ts == 20.0 + + +def test_build_task_windows_string_input(): + steps = [ + _payload( + "TOOL_START", + 1.0, + uuid="u", + name="task", + data={"input": "{'subagent_type': 'researcher-agent'}"}, + ), + _payload("TOOL_END", 2.0, uuid="u", name="task"), + ] + wins = _build_task_windows(steps) + assert len(wins) == 1 + assert wins[0].phase == PHASE_RESEARCHER + + +def test_infer_phase_orchestrator_outside_windows(): + wins = [_TaskWindow(uuid="a", subagent_type="planner-agent", start_ts=10.0, end_ts=20.0)] + assert _infer_phase(5.0, wins) == PHASE_ORCHESTRATOR + assert _infer_phase(25.0, wins) == PHASE_ORCHESTRATOR + + +def test_infer_phase_inside_window(): + wins = [_TaskWindow(uuid="a", subagent_type="planner-agent", start_ts=10.0, end_ts=20.0)] + assert _infer_phase(15.0, wins) == PHASE_PLANNER + + +def test_infer_phase_first_match_wins_on_overlap(): + planner = _TaskWindow(uuid="p", subagent_type="planner-agent", start_ts=10.0, end_ts=25.0) + researcher = _TaskWindow(uuid="r", subagent_type="researcher-agent", start_ts=15.0, end_ts=30.0) + ts = 18.0 + assert _infer_phase(ts, [planner, researcher]) == PHASE_PLANNER + assert _infer_phase(ts, [researcher, planner]) == PHASE_RESEARCHER + + +def _minimal_pricing() -> PricingRegistry: + return PricingRegistry.from_dict( + { + "models": { + "test-model": { + "input_per_1m_tokens": 1.0, + "output_per_1m_tokens": 2.0, + }, + }, + "default": {"input_per_1m_tokens": 1.0, "output_per_1m_tokens": 2.0}, + "tools": {}, + } + ) + + +def _llm_end(ts: float, uuid: str, span_ts: float | None = None) -> dict: + body = { + "event_type": "LLM_END", + "event_timestamp": ts, + "UUID": uuid, + "name": "test-model", + "usage_info": { + "token_usage": { + "prompt_tokens": 1000, + "cached_tokens": 0, + "completion_tokens": 500, + }, + }, + } + if span_ts is not None: + body["span_event_timestamp"] = span_ts + return {"payload": body} + + +def test_parse_trace_end_to_end(tmp_path: Path): + """Orchestrator LLM outside task; planner LLM inside task window; one tool call.""" + steps = [ + _payload("WORKFLOW_START", 100.0, uuid="w0", data={"input": "my question?"}), + _llm_end(101.0, "l0", span_ts=100.5), + _payload( + "TOOL_START", + 102.0, + uuid="task1", + name="task", + data={"input": {"subagent_type": "planner-agent"}}, + ), + _llm_end(103.0, "l1", span_ts=102.5), + _payload("TOOL_START", 103.5, uuid="tool-a", name="search_tool"), + _payload("TOOL_END", 104.0, uuid="tool-a", name="search_tool"), + _payload("TOOL_END", 105.0, uuid="task1", name="task"), + _payload("WORKFLOW_END", 106.0, uuid="w1"), + ] + trace_path = tmp_path / "trace.json" + trace_path.write_text(json.dumps([{"request_number": 0, "intermediate_steps": steps}]), encoding="utf-8") + + profiles = parse_trace(str(trace_path), _minimal_pricing()) + assert len(profiles) == 1 + prof = profiles[0] + assert prof.request_index == 0 + assert prof.question == "my question?" + assert prof.duration_s == pytest.approx(6.0) + assert prof.total_llm_calls == 2 + assert prof.tool_calls.get("search_tool") == 1 + + orch = [p for p in prof.phases if p.phase == PHASE_ORCHESTRATOR] + plan = [p for p in prof.phases if p.phase == PHASE_PLANNER] + assert len(orch) == 1 and orch[0].llm_calls == 1 + assert len(plan) == 1 and plan[0].llm_calls == 1 + + assert prof.llm_call_events[0]["phase"] == PHASE_ORCHESTRATOR + assert prof.llm_call_events[1]["phase"] == PHASE_PLANNER + + +def test_parse_trace_skips_broken_request(tmp_path: Path): + bad = [{"request_number": 0, "intermediate_steps": "not-a-list"}] + good_steps = [ + _payload("WORKFLOW_START", 1.0, data={"input": ""}), + _payload("WORKFLOW_END", 2.0), + ] + good = [{"request_number": 1, "intermediate_steps": good_steps}] + trace_path = tmp_path / "trace.json" + trace_path.write_text(json.dumps(bad + good), encoding="utf-8") + + profiles = parse_trace(str(trace_path), _minimal_pricing()) + assert len(profiles) == 1 + assert profiles[0].request_index == 1 diff --git a/tests/tokenomics/test_pricing.py b/tests/tokenomics/test_pricing.py new file mode 100644 index 00000000..1dc05af5 --- /dev/null +++ b/tests/tokenomics/test_pricing.py @@ -0,0 +1,136 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Tests for tokenomics pricing (ModelPrice, PricingRegistry). + +Module under test: src/aiq_agent/tokenomics/pricing.py +""" + +import pytest + +from aiq_agent.tokenomics.pricing import ModelPrice +from aiq_agent.tokenomics.pricing import ModelPriceConfig +from aiq_agent.tokenomics.pricing import PricingRegistry +from aiq_agent.tokenomics.pricing import PricingRegistryConfig +from aiq_agent.tokenomics.pricing import ToolPriceConfig + + +def test_model_price_cost_uncached_and_completion(): + mp = ModelPrice( + input_per_1m_tokens=1.0, + cached_input_per_1m_tokens=0.5, + output_per_1m_tokens=4.0, + ) + # 1M prompt (all uncached) + 500k completion + assert mp.cost(1_000_000, 0, 500_000) == pytest.approx(1.0 + 2.0) + + +def test_model_price_cost_with_cache_split(): + mp = ModelPrice( + input_per_1m_tokens=2.0, + cached_input_per_1m_tokens=0.5, + output_per_1m_tokens=1.0, + ) + # 800k prompt, 300k cached -> 500k uncached + assert mp.cost(800_000, 300_000, 100_000) == pytest.approx( + (500_000 * 2.0 + 300_000 * 0.5 + 100_000 * 1.0) / 1_000_000 + ) + + +def test_model_price_cache_savings(): + mp = ModelPrice( + input_per_1m_tokens=2.0, + cached_input_per_1m_tokens=0.5, + output_per_1m_tokens=1.0, + ) + assert mp.cache_savings(400_000) == pytest.approx(400_000 * (2.0 - 0.5) / 1_000_000) + + +def test_pricing_registry_from_config_cached_defaults_to_input(): + reg = PricingRegistry.from_config( + PricingRegistryConfig( + models={ + "m": ModelPriceConfig( + input_per_1m_tokens=1.0, + output_per_1m_tokens=2.0, + ), + }, + ) + ) + mp = reg.get("m") + assert mp.cached_input_per_1m_tokens == 1.0 + + +def test_pricing_registry_get_exact_match(): + reg = PricingRegistry.from_dict( + { + "models": { + "azure/openai/gpt-5.2": { + "input_per_1m_tokens": 1.0, + "output_per_1m_tokens": 2.0, + }, + }, + "default": {"input_per_1m_tokens": 9.0, "output_per_1m_tokens": 9.0}, + } + ) + assert reg.get("azure/openai/gpt-5.2").input_per_1m_tokens == 1.0 + + +def test_pricing_registry_get_substring_match(): + reg = PricingRegistry.from_dict( + { + "models": { + "gpt-5.2": {"input_per_1m_tokens": 1.5, "output_per_1m_tokens": 3.0}, + }, + "default": {"input_per_1m_tokens": 9.0, "output_per_1m_tokens": 9.0}, + } + ) + p = reg.get("azure/openai/gpt-5.2") + assert p.input_per_1m_tokens == 1.5 + + +def test_pricing_registry_get_fallback_default(): + reg = PricingRegistry.from_dict( + { + "models": {}, + "default": {"input_per_1m_tokens": 0.5, "output_per_1m_tokens": 1.5}, + } + ) + assert reg.get("any/model").output_per_1m_tokens == 1.5 + + +def test_pricing_registry_get_missing_raises(): + reg = PricingRegistry.from_dict({"models": {}}) + with pytest.raises(KeyError, match="No price configured"): + reg.get("unknown/model") + + +def test_pricing_registry_get_tool_exact_and_substring(): + reg = PricingRegistry.from_dict( + { + "models": {"m": {"input_per_1m_tokens": 1.0, "output_per_1m_tokens": 1.0}}, + "tools": { + "paper_search": ToolPriceConfig(cost_per_call=0.0003), + }, + } + ) + assert reg.get_tool("paper_search").cost_per_call == pytest.approx(0.0003) + assert reg.get_tool("my_paper_search_tool").cost_per_call == pytest.approx(0.0003) + + +def test_pricing_registry_get_tool_unknown_is_zero(): + reg = PricingRegistry.from_dict( + {"models": {"m": {"input_per_1m_tokens": 1.0, "output_per_1m_tokens": 1.0}}, "tools": {}} + ) + assert reg.get_tool("no_such_tool").cost_per_call == 0.0 + + +def test_pricing_registry_known_models_and_tools(): + reg = PricingRegistry.from_dict( + { + "models": {"a": {"input_per_1m_tokens": 1.0, "output_per_1m_tokens": 1.0}}, + "tools": {"t": {"cost_per_call": 0.01}}, + } + ) + assert reg.known_models() == ["a"] + assert reg.known_tools() == ["t"] diff --git a/tests/tokenomics/test_profile.py b/tests/tokenomics/test_profile.py new file mode 100644 index 00000000..aac889b5 --- /dev/null +++ b/tests/tokenomics/test_profile.py @@ -0,0 +1,99 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Tests for tokenomics profile dataclasses. + +Module under test: src/aiq_agent/tokenomics/profile.py +""" + +from aiq_agent.tokenomics.profile import PHASE_ORCHESTRATOR +from aiq_agent.tokenomics.profile import PHASE_PLANNER +from aiq_agent.tokenomics.profile import PHASE_RESEARCHER +from aiq_agent.tokenomics.profile import PhaseStats +from aiq_agent.tokenomics.profile import RequestProfile + + +def test_phase_stats_derived_fields(): + ps = PhaseStats( + phase=PHASE_ORCHESTRATOR, + model="m", + prompt_tokens=100, + cached_tokens=25, + completion_tokens=50, + ) + assert ps.uncached_tokens == 75 + assert ps.cache_hit_rate == 0.25 + assert ps.total_tokens == 150 + + +def test_phase_stats_cache_hit_rate_zero_prompt(): + ps = PhaseStats(phase=PHASE_ORCHESTRATOR, model="m", prompt_tokens=0, cached_tokens=0) + assert ps.cache_hit_rate == 0.0 + + +def test_request_profile_grand_total_and_cache_rate(): + prof = RequestProfile( + request_index=0, + question="q", + duration_s=1.0, + total_cost_usd=10.0, + total_tool_cost_usd=2.5, + total_prompt_tokens=200, + total_cached_tokens=50, + total_completion_tokens=100, + phases=[], + ) + assert prof.grand_total_cost_usd == 12.5 + assert prof.cache_hit_rate == 0.25 + + +def test_request_profile_phases_for_and_cost(): + prof = RequestProfile( + request_index=0, + question="q", + duration_s=1.0, + phases=[ + PhaseStats(phase=PHASE_PLANNER, model="a", cost_usd=1.0, prompt_tokens=10), + PhaseStats(phase=PHASE_RESEARCHER, model="b", cost_usd=3.0, prompt_tokens=20), + PhaseStats(phase=PHASE_RESEARCHER, model="c", cost_usd=2.0, prompt_tokens=30), + ], + ) + assert len(prof.phases_for(PHASE_RESEARCHER)) == 2 + assert prof.cost_for_phase(PHASE_PLANNER) == 1.0 + assert prof.cost_for_phase(PHASE_RESEARCHER) == 5.0 + + +def test_request_profile_tokens_for_phase(): + prof = RequestProfile( + request_index=0, + question="q", + duration_s=1.0, + phases=[ + PhaseStats( + phase=PHASE_RESEARCHER, + model="b", + prompt_tokens=100, + cached_tokens=40, + completion_tokens=60, + ), + PhaseStats( + phase=PHASE_RESEARCHER, + model="c", + prompt_tokens=50, + cached_tokens=10, + completion_tokens=20, + ), + ], + ) + p, c, o = prof.tokens_for_phase(PHASE_RESEARCHER) + assert (p, c, o) == (150, 50, 80) + + +def test_request_profile_total_tool_calls(): + prof = RequestProfile( + request_index=0, + question="q", + duration_s=1.0, + tool_calls={"a": 2, "b": 5}, + ) + assert prof.total_tool_calls == 7 diff --git a/tests/tokenomics/test_report_builders.py b/tests/tokenomics/test_report_builders.py new file mode 100644 index 00000000..b40f945b --- /dev/null +++ b/tests/tokenomics/test_report_builders.py @@ -0,0 +1,143 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Tests for tokenomics report data builders. + +Module under test: src/aiq_agent/tokenomics/report/_report_builders.py +""" + +from aiq_agent.tokenomics.pricing import PricingRegistry +from aiq_agent.tokenomics.profile import PHASE_ORCHESTRATOR +from aiq_agent.tokenomics.profile import PhaseStats +from aiq_agent.tokenomics.profile import RequestProfile +from aiq_agent.tokenomics.report._report_builders import _build_comparison_data +from aiq_agent.tokenomics.report._report_builders import _build_report_data + + +def _minimal_pricing() -> PricingRegistry: + return PricingRegistry.from_dict( + { + "models": { + "m1": { + "input_per_1m_tokens": 1.0, + "output_per_1m_tokens": 2.0, + }, + }, + "tools": {"search": {"cost_per_call": 0.001}}, + "default": None, + } + ) + + +def _minimal_profile(**overrides) -> RequestProfile: + base = dict( + request_index=0, + question="hello", + duration_s=3.0, + phases=[ + PhaseStats( + phase=PHASE_ORCHESTRATOR, + model="m1", + llm_calls=1, + prompt_tokens=1000, + cached_tokens=0, + completion_tokens=100, + cost_usd=0.002, + cache_savings_usd=0.0, + ), + ], + tool_calls={"search": 1}, + llm_call_events=[ + { + "uuid": "llm-1", + "isl": 1000, + "osl": 100, + "cached": 0, + "reasoning": 0, + "dur_s": 2.0, + "tps": 50.0, + "model": "m1", + "phase": PHASE_ORCHESTRATOR, + "call_idx": 0, + }, + ], + tool_call_events=[ + {"tool": "search", "dur_s": 0.5, "cost_usd": 0.001}, + ], + total_llm_calls=1, + total_prompt_tokens=1000, + total_cached_tokens=0, + total_completion_tokens=100, + total_cost_usd=0.002, + total_tool_cost_usd=0.001, + total_cache_savings_usd=0.0, + ) + base.update(overrides) + return RequestProfile(**base) + + +def test_build_report_data_totals_and_per_query(): + pricing = _minimal_pricing() + prof = _minimal_profile() + rd = _build_report_data([prof], pricing, "/tmp/pricing.yml") + + assert rd["num_queries"] == 1 + assert rd["total_llm_calls"] == 1 + assert rd["total_prompt_tokens"] == 1000 + assert rd["total_completion_tokens"] == 100 + assert rd["llm_cost_usd"] == 0.002 + assert rd["tool_cost_usd"] == 0.001 + assert rd["total_cost_usd"] == 0.003 + assert rd["by_model"]["m1"] == 0.002 + assert "Orchestrator" in rd["by_phase"] + assert rd["per_query"][0]["id"] == 0 + assert rd["per_query"][0]["question"] == "hello" + assert rd["token_stats"]["by_model"]["m1"]["calls"] == 1 + assert rd["llm_latency"]["m1"]["count"] == 1 + assert rd["tool_latency"]["search"]["count"] == 1 + + +def test_build_report_data_predicted_vs_actual(): + pricing = _minimal_pricing() + prof = _minimal_profile() + pred = {"llm-1": 99.0} + rd = _build_report_data([prof], pricing, "/tmp/x.yml", predicted_osl_map=pred) + pva = rd["token_stats"]["predicted_vs_actual"] + assert len(pva) == 1 + assert pva[0]["predicted"] == 99.0 + assert pva[0]["actual"] == 100 + + +def test_build_comparison_data_aligned_queries(): + pricing = _minimal_pricing() + prof = _minimal_profile(request_index=1) + a = _build_report_data([prof], pricing, "/tmp/a.yml") + b = _build_report_data([prof], pricing, "/tmp/b.yml") + a["label"] = "run_a" + b["label"] = "run_b" + b["total_cost_usd"] = a["total_cost_usd"] + 0.01 + + cmp = _build_comparison_data([a, b]) + assert cmp["label_a"] == "run_a" + assert cmp["label_b"] == "run_b" + assert cmp["num_common_queries"] == 1 + assert cmp["num_queries_a"] == 1 + assert cmp["num_queries_b"] == 1 + assert cmp["cost_delta"] == 0.01 + assert len(cmp["per_query"]) == 1 + assert cmp["per_query"][0]["in_both"] is True + assert cmp["per_query"][0]["cost_delta"] is not None + + +def test_build_comparison_data_union_when_ids_differ(): + pricing = _minimal_pricing() + a = _build_report_data([_minimal_profile(request_index=1)], pricing, "/a.yml") + b = _build_report_data([_minimal_profile(request_index=2)], pricing, "/b.yml") + a["label"] = "a" + b["label"] = "b" + + cmp = _build_comparison_data([a, b]) + assert cmp["num_common_queries"] == 0 + assert len(cmp["per_query"]) == 2 + both_flags = {row["in_both"] for row in cmp["per_query"]} + assert both_flags == {False} diff --git a/tests/tokenomics/test_report_stats.py b/tests/tokenomics/test_report_stats.py new file mode 100644 index 00000000..b48beeed --- /dev/null +++ b/tests/tokenomics/test_report_stats.py @@ -0,0 +1,67 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Tests for tokenomics report statistics helpers. + +Module under test: src/aiq_agent/tokenomics/report/_report_stats.py +""" + +from aiq_agent.tokenomics.report._report_stats import _latency_stats +from aiq_agent.tokenomics.report._report_stats import _load_csv_predictions +from aiq_agent.tokenomics.report._report_stats import _pct + + +def test_pct_empty(): + assert _pct([], 50) == 0.0 + + +def test_pct_single_value(): + assert _pct([7.0], 50) == 7.0 + assert _pct([7.0], 99) == 7.0 + + +def test_pct_two_values_median(): + # k = 0.5 → linear interpolation between sorted[0] and sorted[1] + assert _pct([10.0, 20.0], 50) == 15.0 + + +def test_pct_sorted_order_irrelevant(): + assert _pct([30.0, 10.0, 20.0], 50) == 20.0 + + +def test_latency_stats_empty(): + assert _latency_stats([]) == { + "count": 0, + "p50_ms": 0.0, + "p90_ms": 0.0, + "p99_ms": 0.0, + "max_ms": 0.0, + "mean_ms": 0.0, + } + + +def test_latency_stats_non_empty(): + out = _latency_stats([0.1, 0.2]) + assert out["count"] == 2 + assert out["mean_ms"] == 150.0 + assert out["max_ms"] == 200.0 + assert out["p50_ms"] == 150.0 + + +def test_load_csv_predictions_missing_file(tmp_path): + trace = tmp_path / "all_requests_profiler_traces.json" + trace.write_text("[]") + assert _load_csv_predictions(str(trace)) == {} + + +def test_load_csv_predictions_parses_llm_start_rows(tmp_path): + sub = tmp_path / "run" + sub.mkdir() + trace = sub / "all_requests_profiler_traces.json" + trace.write_text("[]") + csv_file = sub / "standardized_data_all.csv" + csv_file.write_text( + "event_type,UUID,NOVA-Predicted-OSL\nLLM_START,u-1,12.5\nTOOL_START,u-2,99\nLLM_START,u-3,not_a_float\n" + ) + got = _load_csv_predictions(str(trace)) + assert got == {"u-1": 12.5}