fix: critical issues - resource leak, security vulnerability, and tok…#6
Conversation
…en counting accuracy - Fix HTTP client resource leak by implementing FastAPI lifespan manager - Fix security vulnerability in authorization header parsing (case-insensitive) - Fix incorrect timestamps using file modification time instead of current time - Fix silent JSON parsing failures with proper logging - Fix token counting to include tool_calls and system prompts - Add comprehensive test script for token counting validation Resolves: - Memory leaks in long-running deployments - Potential authorization bypass vulnerability - Inaccurate token usage reporting (20-50% underestimation) - Debugging difficulties due to silent failures
There was a problem hiding this comment.
Pull Request Overview
Fixes multiple production issues: HTTP client lifecycle leaks, case-insensitive Authorization handling, timestamp correctness, more robust JSON coercion, and significantly improved token counting (including tool_calls). Also adds a script to validate token counting.
- Implement FastAPI lifespan to initialize/close a shared httpx.AsyncClient
- Harden Authorization header parsing and improve token counting for tool_calls and system prompts
- Replace file mtime timestamps with current time; improve JSON coercion logs; add token counting test script
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| test_token_counting.py | Adds a script to exercise token counting across scenarios and models. |
| main.py | Introduces lifespan-managed http client, strengthens Authorization parsing, improves token counting (tool_calls and timing), fixes timestamps, and enhances JSON coercion logging. |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| # Count content tokens | ||
| content_tokens = token_counter.count_text_tokens(completion_text, body.model) if completion_text else 0 | ||
|
|
||
| # Count tool_calls tokens if present | ||
| tool_calls_tokens = 0 | ||
| if completion_tool_calls: | ||
| # Create a temporary message to count tool_calls tokens | ||
| temp_msg = {"role": "assistant", "tool_calls": completion_tool_calls} | ||
| # Count just the tool_calls part | ||
| tool_calls_tokens = token_counter.count_tokens([temp_msg], body.model) | ||
| # Subtract the message overhead to get just tool_calls tokens | ||
| tool_calls_tokens -= 3 # Remove assistant priming overhead | ||
| logger.debug(f"🔧 Counted {tool_calls_tokens} tokens for {len(completion_tool_calls)} tool calls") | ||
|
|
||
| estimated_completion_tokens = content_tokens + tool_calls_tokens |
There was a problem hiding this comment.
tool_calls token isolation subtracts only the reply priming (+3) but not the per-message overhead (3 for most models, 4 for gpt-3.5-turbo/gpt-35-turbo). This overestimates tool_calls tokens by 3–4 tokens. Subtract both the per-message overhead and the reply priming, e.g.: compute msg_overhead = 4 if model startswith(('gpt-3.5-turbo','gpt-35-turbo')) else 3, then do tool_calls_tokens -= (msg_overhead + 3). Even better, add a TokenCounter helper to return these overheads to avoid duplicating model-specific logic.
| tool_calls_tokens -= 3 # Remove assistant priming overhead | ||
| logger.debug(f"🔧 Counted {tool_calls_tokens} tokens for {len(completion_tool_calls)} tool calls in stream") |
There was a problem hiding this comment.
Same issue as non-stream path: only the reply priming (+3) is removed, but the per-message overhead isn't. This inflates tool_calls token counts by 3–4 tokens. Subtract both the per-message overhead (3 for most models, 4 for gpt-3.5-turbo/gpt-35-turbo) and the +3 reply priming.
| tool_calls_tokens -= 3 # Remove assistant priming overhead | |
| logger.debug(f"🔧 Counted {tool_calls_tokens} tokens for {len(completion_tool_calls)} tool calls in stream") | |
| # Subtract both per-message overhead and reply priming (+3) | |
| # Most models: per-message overhead is 3, gpt-3.5-turbo/gpt-35-turbo: 4 | |
| per_message_overhead = 4 if body.model in ("gpt-3.5-turbo", "gpt-35-turbo") else 3 | |
| tool_calls_tokens -= (per_message_overhead + 3) | |
| logger.debug(f"🔧 Counted {tool_calls_tokens} tokens for {len(completion_tool_calls)} tool calls in stream (subtracted {per_message_overhead}+3 overhead)") |
| import os | ||
|
|
||
| # Add parent directory to path | ||
| sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) |
There was a problem hiding this comment.
The comment says 'Add parent directory', but the code adds the script's directory. Either update the comment or actually add the parent with os.path.dirname(os.path.dirname(os.path.abspath(file))).
| sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) | |
| sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) |
| # Count content tokens | ||
| content_tokens = token_counter.count_text_tokens(completion_text, body.model) if completion_text else 0 | ||
|
|
||
| # Count tool_calls tokens if present | ||
| tool_calls_tokens = 0 | ||
| if completion_tool_calls: | ||
| # Create a temporary message to count tool_calls tokens | ||
| temp_msg = {"role": "assistant", "tool_calls": completion_tool_calls} | ||
| # Count just the tool_calls part | ||
| tool_calls_tokens = token_counter.count_tokens([temp_msg], body.model) | ||
| # Subtract the message overhead to get just tool_calls tokens | ||
| tool_calls_tokens -= 3 # Remove assistant priming overhead | ||
| logger.debug(f"🔧 Counted {tool_calls_tokens} tokens for {len(completion_tool_calls)} tool calls") | ||
|
|
||
| estimated_completion_tokens = content_tokens + tool_calls_tokens |
There was a problem hiding this comment.
The logic to estimate completion tokens (content + tool_calls with model-specific overhead corrections) is duplicated here and in the streaming path. Consider extracting a single helper (e.g., TokenCounter.count_tool_calls_tokens(tool_calls, model) and a shared estimate_completion_tokens(content, tool_calls, model)) to centralize the model-specific overhead handling and avoid divergence.
| # Check authorization header format (case-insensitive) | ||
| if not authorization.lower().startswith("bearer "): | ||
| raise HTTPException(status_code=401, detail="Invalid authorization header format") | ||
|
|
||
| # Extract the token (everything after "Bearer ") | ||
| client_key = authorization[7:].strip() | ||
|
|
||
| # Ensure the key is not empty | ||
| if not client_key: | ||
| raise HTTPException(status_code=401, detail="Missing API key") |
There was a problem hiding this comment.
Parsing Authorization by slicing index 7 assumes a single space and no extra whitespace. Use robust parsing to handle multiple spaces/tabs: scheme, _, token = authorization.partition(' '); if scheme.lower() != 'bearer' or not token.strip(): raise HTTPException(...). This avoids edge cases and keeps the check case-insensitive.
|
Thank you for your contribution! After a careful review I would suggest following changes:
Thanks again for your great work on Toolify! |
…en counting accuracy
Resolves: