Skip to content

Conversation

@LJH-LBJ
Copy link

@LJH-LBJ LJH-LBJ commented Sep 29, 2025

What this PR does / why we need it?

This PR adds an optional Time‑to‑First‑Token (TTFT) pipeline. When enabled, the Encoder (E) and Prefill/Decode (PD) components timestamp key latency stages and asynchronously report them to the proxy, which merges per‑request partials.

Environment Variables

TTFT_ENABLED=1 (default off). When not set or 0, no overhead or network traffic.
VLLM_TTFT_REPORT_URL=http://:/ttft_report Target HTTP endpoint (proxy). If unset and TTFT disabled → no effect; if unset but enabled → falls back to http://127.0.0.1:28886/ttft_report (can be documented / adjusted).

Collected Stage Metrics (all ms):

enc_queue_time_ms
enc_compute_time_ms
emb_cache_transfer_time_ms (optional)
prefill_queue_time_ms
prefill_compute_time_ms
Derived when all present: ttft_ms = sum(stage_times).

Transport

Each stage is emitted exactly once as a tiny JSON payload:
{"role":"meta|encoder|pd","request_id":"","<metric_name>":}
Sent via fire‑and‑forget TCP socket (raw HTTP/1.1 POST, no response wait) to minimize tail latency impact; failures are silently dropped.

Proxy Aggregation

Proxy keeps an in‑memory map keyed by request_id, merges first‑write values, and (optionally) logs only on new stage arrival or completion. Query: GET /ttft_report/<request_id> returns merged by_role + combined.

Usage

Set TTFT_ENABLED=1 and VLLM_TTFT_REPORT_URL.
Start services (E, PD, proxy).
Inspect /ttft_report/<request_id> for structured TTFT breakdown.

Does this PR introduce any user-facing change?

Set TTFT_ENABLED=1 and VLLM_TTFT_REPORT_URL.

How was this patch tested?

Details #!/usr/bin/env bash set -euo pipefail

wait_for_server() {
local port=$1
timeout 12000 bash -c '
until curl -s "http://localhost:'"$port"'/v1/chat/completions" > /dev/null; do
sleep 1
done
' && return 0 || return 1
}

export VLLM_USE_V1=1

LOG_PATH=${LOG_PATH:-./logs}
mkdir -p "$LOG_PATH"

ENCODE_PORT=19534
PREFILL_DECODE_PORT=19535
PROXY_PORT=28886

GPU_E=4
GPU_PD=5

START_TIME=$(date +"%Y%m%d_%H%M%S")
ENC_LOG="$LOG_PATH/encoder.log"
PD_LOG="$LOG_PATH/pd.log"
PROXY_LOG="$LOG_PATH/proxy.log"
PID_FILE="./pid.txt"

SHARED_STORAGE_PATH="/workspace/l00807937/EPD_count/vllm-LJH-for-jiaofu/examples/online_serving/epd/storage"
export VLLM_TTFT_REPORT_URL=http://127.0.0.1:${PROXY_PORT}/ttft_report
export TTFT_ENABLED=True

export VLLM_HTTP_TIMEOUT_KEEP_ALIVE="${VLLM_HTTP_TIMEOUT_KEEP_ALIVE:-70}"
export CLIENT_HTTP_TIMEOUT_KEEP_ALIVE="${CLIENT_HTTP_TIMEOUT_KEEP_ALIVE:-60}"

if (( $CLIENT_HTTP_TIMEOUT_KEEP_ALIVE >= $VLLM_HTTP_TIMEOUT_KEEP_ALIVE )); then
echo "Error: Client keep alive timeout ($CLIENT_HTTP_TIMEOUT_KEEP_ALIVE) should be"
" < server keep alive timeout ($VLLM_HTTP_TIMEOUT_KEEP_ALIVE)."
exit 1
fi

ASCEND_RT_VISIBLE_DEVICES="$GPU_E" vllm serve "/workspace/models/Qwen2.5-VL-7B-Instruct"
--gpu-memory-utilization 0.0
--port "$ENCODE_PORT"
--enable-request-id-headers
--no-enable-prefix-caching
--max-num-seqs 4
--max-model-len 4096
--max-num-batched-tokens 4096
--enforce-eager
--allowed-local-media-path /workspace/l00807937/
--ec-transfer-config '{
"ec_connector": "ECSharedStorageConnector",
"ec_role": "ec_producer",
"ec_connector_extra_config": {
"shared_storage_path": "'"$SHARED_STORAGE_PATH"'",
"ec_max_num_scheduled_tokens": "4096"
}
}'
>"$ENC_LOG" 2>&1 &

echo $! >> "$PID_FILE"

ASCEND_RT_VISIBLE_DEVICES="$GPU_PD" vllm serve "/workspace/models/Qwen2.5-VL-7B-Instruct"
--gpu-memory-utilization 0.9
--port "$PREFILL_DECODE_PORT"
--enable-request-id-headers
--max-num-seqs 128
--enforce-eager
--max-model-len 4096
--max-num-batched-tokens 4096
--allowed-local-media-path /workspace/l00807937/
--ec-transfer-config '{
"ec_connector": "ECSharedStorageConnector",
"ec_role": "ec_consumer",
"ec_connector_extra_config": {
"shared_storage_path": "'"$SHARED_STORAGE_PATH"'"
}
}'
>"$PD_LOG" 2>&1 &

echo $! >> "$PID_FILE"

wait_for_server "$ENCODE_PORT"
wait_for_server "$PREFILL_DECODE_PORT"

python /workspace/l00807937/EPD_count/vllm-LJH-for-jiaofu/examples/online_serving/epd/proxy.py
--host "127.0.0.1"
--port "$PROXY_PORT"
--encode-servers-urls "http://localhost:$ENCODE_PORT"
--prefill-decode-servers-urls "http://localhost:$PREFILL_DECODE_PORT"
>"$PROXY_LOG" 2>&1 &

echo $! >> "$PID_FILE"

echo "All services are up!"

Test Result

INFO:main:[TTFT] report update for 3a8713d1-494c-4ec2-8848-672b4571b72d: {'request_id': '3a8713d1-494c-4ec2-8848-672b4571b72d', 'combined': {'enc_queue_time_ms': 0.002739951014518738, 'enc_compute_time_ms': 1002.8023803606629, 'emb_cache_transfer_time_ms': 2.2039590403437614, 'prefill_queue_time_ms': 0.0, 'prefill_compute_time_ms': 207.5677178800106, 't_request_ingress_ns': None, 'enc_ingress_wait_ms': None, 'ttft_ms': 1212.5767972320318}}
INFO:main:[TTFT] report update for 15879f51-130d-4250-9771-1d8389ae3fd3: {'request_id': '15879f51-130d-4250-9771-1d8389ae3fd3', 'combined': {'enc_queue_time_ms': 0.0034300610423088074, 'enc_compute_time_ms': 56.724236346781254, 'emb_cache_transfer_time_ms': 1.6047926619648933, 'prefill_queue_time_ms': 0.0, 'prefill_compute_time_ms': 28.90088688582182, 't_request_ingress_ns': None, 'enc_ingress_wait_ms': None, 'ttft_ms': 87.23334595561028}}

Signed-off-by: Junhong <[email protected]>
Signed-off-by: Junhong <[email protected]>
Signed-off-by: Junhong <[email protected]>
Signed-off-by: Junhong <[email protected]>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Time-to-First-Token (TTFT) reporting pipeline, which is a valuable addition for performance monitoring. The implementation correctly adds timing for key stages and reports them. However, I've identified a critical issue in how the feature is enabled via an environment variable, which causes it to be active by default, contrary to the documentation. Additionally, there's a high-severity issue regarding the accuracy of the prefill_compute_time_ms metric in chunked prefill scenarios. Addressing these points will significantly improve the robustness and correctness of this new feature.

self.in_profile_run = False

# TTFT timecount
self.TTFT_ENABLED = os.environ.get("TTFT_ENABLED", "0")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The environment variable TTFT_ENABLED is read as a string. When not set, it defaults to "0". In Python, any non-empty string is truthy, so if "0": evaluates to True. This causes the TTFT feature to be enabled by default, which contradicts the PR description ("default off") and can lead to unexpected overhead and network traffic. The check should be more explicit to only enable the feature for values like "1" or "true".

Suggested change
self.TTFT_ENABLED = os.environ.get("TTFT_ENABLED", "0")
self.TTFT_ENABLED = os.environ.get("TTFT_ENABLED", "0").lower() in ("1", "true")

Comment on lines 1236 to 1248
for req_id in scheduler_output.num_scheduled_tokens.keys():
if req_id not in self._ttft_prefill_compute_reported:
rid = req_id
if rid.startswith("chatcmpl-"):
rid = rid[len("chatcmpl-"):]
payload = {
"role": "pd",
"request_id": rid,
"emb_cache_transfer_time_ms": xfer_ms if mm_embeds else 0.0,
"prefill_compute_time_ms": prefill_ms,
}
self._ttft_prefill_compute_reported.add(req_id)
send_ttft_report(payload)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The prefill_compute_time_ms metric currently reports the compute time of the batch that contains the first prefill chunk for a request. If chunked prefill is used, this value does not represent the total prefill compute time for the request, which can be misleading. Since the proxy uses a "first-write wins" merge strategy, subsequent reports for other chunks of the same request would be ignored.

To report the correct total prefill time, you could consider one of the following approaches:

  1. Accumulate the prefill time for each request across all its chunks within the worker and send a single report only when the prefill phase is complete for that request.
  2. Modify the proxy to sum values for the same metric instead of only keeping the first value. This would allow the worker to report the time for each chunk as a partial value.

As it stands, the reported metric is inaccurate for chunked prefill scenarios.

Signed-off-by: Junhong <[email protected]>
Signed-off-by: Junhong <[email protected]>
Signed-off-by: Junhong <[email protected]>
Signed-off-by: Junhong <[email protected]>
Signed-off-by: Junhong <[email protected]>
Signed-off-by: Junhong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant