-
Notifications
You must be signed in to change notification settings - Fork 6
TTFT : Staged Encoder (E) & PD Timing with Socket Reporting and Proxy Aggregation #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v0.9.1
Are you sure you want to change the base?
Conversation
Signed-off-by: Junhong <[email protected]>
Signed-off-by: Junhong <[email protected]>
Signed-off-by: Junhong <[email protected]>
Signed-off-by: Junhong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a Time-to-First-Token (TTFT) reporting pipeline, which is a valuable addition for performance monitoring. The implementation correctly adds timing for key stages and reports them. However, I've identified a critical issue in how the feature is enabled via an environment variable, which causes it to be active by default, contrary to the documentation. Additionally, there's a high-severity issue regarding the accuracy of the prefill_compute_time_ms metric in chunked prefill scenarios. Addressing these points will significantly improve the robustness and correctness of this new feature.
| self.in_profile_run = False | ||
|
|
||
| # TTFT timecount | ||
| self.TTFT_ENABLED = os.environ.get("TTFT_ENABLED", "0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The environment variable TTFT_ENABLED is read as a string. When not set, it defaults to "0". In Python, any non-empty string is truthy, so if "0": evaluates to True. This causes the TTFT feature to be enabled by default, which contradicts the PR description ("default off") and can lead to unexpected overhead and network traffic. The check should be more explicit to only enable the feature for values like "1" or "true".
| self.TTFT_ENABLED = os.environ.get("TTFT_ENABLED", "0") | |
| self.TTFT_ENABLED = os.environ.get("TTFT_ENABLED", "0").lower() in ("1", "true") |
| for req_id in scheduler_output.num_scheduled_tokens.keys(): | ||
| if req_id not in self._ttft_prefill_compute_reported: | ||
| rid = req_id | ||
| if rid.startswith("chatcmpl-"): | ||
| rid = rid[len("chatcmpl-"):] | ||
| payload = { | ||
| "role": "pd", | ||
| "request_id": rid, | ||
| "emb_cache_transfer_time_ms": xfer_ms if mm_embeds else 0.0, | ||
| "prefill_compute_time_ms": prefill_ms, | ||
| } | ||
| self._ttft_prefill_compute_reported.add(req_id) | ||
| send_ttft_report(payload) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prefill_compute_time_ms metric currently reports the compute time of the batch that contains the first prefill chunk for a request. If chunked prefill is used, this value does not represent the total prefill compute time for the request, which can be misleading. Since the proxy uses a "first-write wins" merge strategy, subsequent reports for other chunks of the same request would be ignored.
To report the correct total prefill time, you could consider one of the following approaches:
- Accumulate the prefill time for each request across all its chunks within the worker and send a single report only when the prefill phase is complete for that request.
- Modify the proxy to sum values for the same metric instead of only keeping the first value. This would allow the worker to report the time for each chunk as a partial value.
As it stands, the reported metric is inaccurate for chunked prefill scenarios.
Signed-off-by: Junhong <[email protected]>
Signed-off-by: Junhong <[email protected]>
Signed-off-by: Junhong <[email protected]>
Signed-off-by: Junhong <[email protected]>
Signed-off-by: Junhong <[email protected]>
Signed-off-by: Junhong <[email protected]>
Signed-off-by: Junhong <[email protected]>
What this PR does / why we need it?
This PR adds an optional Time‑to‑First‑Token (TTFT) pipeline. When enabled, the Encoder (E) and Prefill/Decode (PD) components timestamp key latency stages and asynchronously report them to the proxy, which merges per‑request partials.
Environment Variables
TTFT_ENABLED=1 (default off). When not set or 0, no overhead or network traffic.
VLLM_TTFT_REPORT_URL=http://:/ttft_report Target HTTP endpoint (proxy). If unset and TTFT disabled → no effect; if unset but enabled → falls back to http://127.0.0.1:28886/ttft_report (can be documented / adjusted).
Collected Stage Metrics (all ms):
enc_queue_time_ms
enc_compute_time_ms
emb_cache_transfer_time_ms (optional)
prefill_queue_time_ms
prefill_compute_time_ms
Derived when all present: ttft_ms = sum(stage_times).
Transport
Each stage is emitted exactly once as a tiny JSON payload:
{"role":"meta|encoder|pd","request_id":"","<metric_name>":}
Sent via fire‑and‑forget TCP socket (raw HTTP/1.1 POST, no response wait) to minimize tail latency impact; failures are silently dropped.
Proxy Aggregation
Proxy keeps an in‑memory map keyed by request_id, merges first‑write values, and (optionally) logs only on new stage arrival or completion. Query: GET /ttft_report/<request_id> returns merged by_role + combined.
Usage
Set TTFT_ENABLED=1 and VLLM_TTFT_REPORT_URL.
Start services (E, PD, proxy).
Inspect /ttft_report/<request_id> for structured TTFT breakdown.
Does this PR introduce any user-facing change?
Set TTFT_ENABLED=1 and VLLM_TTFT_REPORT_URL.
How was this patch tested?
Details
#!/usr/bin/env bash set -euo pipefailwait_for_server() {
local port=$1
timeout 12000 bash -c '
until curl -s "http://localhost:'"$port"'/v1/chat/completions" > /dev/null; do
sleep 1
done
' && return 0 || return 1
}
export VLLM_USE_V1=1
LOG_PATH=${LOG_PATH:-./logs}
mkdir -p "$LOG_PATH"
ENCODE_PORT=19534
PREFILL_DECODE_PORT=19535
PROXY_PORT=28886
GPU_E=4
GPU_PD=5
START_TIME=$(date +"%Y%m%d_%H%M%S")
ENC_LOG="$LOG_PATH/encoder.log"
PD_LOG="$LOG_PATH/pd.log"
PROXY_LOG="$LOG_PATH/proxy.log"
PID_FILE="./pid.txt"
SHARED_STORAGE_PATH="/workspace/l00807937/EPD_count/vllm-LJH-for-jiaofu/examples/online_serving/epd/storage"
export VLLM_TTFT_REPORT_URL=http://127.0.0.1:${PROXY_PORT}/ttft_report
export TTFT_ENABLED=True
export VLLM_HTTP_TIMEOUT_KEEP_ALIVE="${VLLM_HTTP_TIMEOUT_KEEP_ALIVE:-70}"
export CLIENT_HTTP_TIMEOUT_KEEP_ALIVE="${CLIENT_HTTP_TIMEOUT_KEEP_ALIVE:-60}"
if (( $CLIENT_HTTP_TIMEOUT_KEEP_ALIVE >= $VLLM_HTTP_TIMEOUT_KEEP_ALIVE )); then
echo "Error: Client keep alive timeout ($CLIENT_HTTP_TIMEOUT_KEEP_ALIVE) should be"
" < server keep alive timeout ($VLLM_HTTP_TIMEOUT_KEEP_ALIVE)."
exit 1
fi
ASCEND_RT_VISIBLE_DEVICES="$GPU_E" vllm serve "/workspace/models/Qwen2.5-VL-7B-Instruct"
--gpu-memory-utilization 0.0
--port "$ENCODE_PORT"
--enable-request-id-headers
--no-enable-prefix-caching
--max-num-seqs 4
--max-model-len 4096
--max-num-batched-tokens 4096
--enforce-eager
--allowed-local-media-path /workspace/l00807937/
--ec-transfer-config '{
"ec_connector": "ECSharedStorageConnector",
"ec_role": "ec_producer",
"ec_connector_extra_config": {
"shared_storage_path": "'"$SHARED_STORAGE_PATH"'",
"ec_max_num_scheduled_tokens": "4096"
}
}'
>"$ENC_LOG" 2>&1 &
echo $! >> "$PID_FILE"
ASCEND_RT_VISIBLE_DEVICES="$GPU_PD" vllm serve "/workspace/models/Qwen2.5-VL-7B-Instruct"
--gpu-memory-utilization 0.9
--port "$PREFILL_DECODE_PORT"
--enable-request-id-headers
--max-num-seqs 128
--enforce-eager
--max-model-len 4096
--max-num-batched-tokens 4096
--allowed-local-media-path /workspace/l00807937/
--ec-transfer-config '{
"ec_connector": "ECSharedStorageConnector",
"ec_role": "ec_consumer",
"ec_connector_extra_config": {
"shared_storage_path": "'"$SHARED_STORAGE_PATH"'"
}
}'
>"$PD_LOG" 2>&1 &
echo $! >> "$PID_FILE"
wait_for_server "$ENCODE_PORT"
wait_for_server "$PREFILL_DECODE_PORT"
python /workspace/l00807937/EPD_count/vllm-LJH-for-jiaofu/examples/online_serving/epd/proxy.py
--host "127.0.0.1"
--port "$PROXY_PORT"
--encode-servers-urls "http://localhost:$ENCODE_PORT"
--prefill-decode-servers-urls "http://localhost:$PREFILL_DECODE_PORT"
>"$PROXY_LOG" 2>&1 &
echo $! >> "$PID_FILE"
echo "All services are up!"
Test Result
INFO:main:[TTFT] report update for 3a8713d1-494c-4ec2-8848-672b4571b72d: {'request_id': '3a8713d1-494c-4ec2-8848-672b4571b72d', 'combined': {'enc_queue_time_ms': 0.002739951014518738, 'enc_compute_time_ms': 1002.8023803606629, 'emb_cache_transfer_time_ms': 2.2039590403437614, 'prefill_queue_time_ms': 0.0, 'prefill_compute_time_ms': 207.5677178800106, 't_request_ingress_ns': None, 'enc_ingress_wait_ms': None, 'ttft_ms': 1212.5767972320318}}
INFO:main:[TTFT] report update for 15879f51-130d-4250-9771-1d8389ae3fd3: {'request_id': '15879f51-130d-4250-9771-1d8389ae3fd3', 'combined': {'enc_queue_time_ms': 0.0034300610423088074, 'enc_compute_time_ms': 56.724236346781254, 'emb_cache_transfer_time_ms': 1.6047926619648933, 'prefill_queue_time_ms': 0.0, 'prefill_compute_time_ms': 28.90088688582182, 't_request_ingress_ns': None, 'enc_ingress_wait_ms': None, 'ttft_ms': 87.23334595561028}}