TTFT : Staged Encoder (E) & PD Timing with Socket Reporting and Proxy Aggregation #12

LJH-LBJ · 2025-09-29T12:37:18Z

What this PR does / why we need it?

This PR adds an optional Time‑to‑First‑Token (TTFT) pipeline. When enabled, the Encoder (E) and Prefill/Decode (PD) components timestamp key latency stages and asynchronously report them to the proxy, which merges per‑request partials.

Environment Variables

TTFT_ENABLED=1 (default off). When not set or 0, no overhead or network traffic.
VLLM_TTFT_REPORT_URL=http://:/ttft_report Target HTTP endpoint (proxy). If unset and TTFT disabled → no effect; if unset but enabled → falls back to http://127.0.0.1:28886/ttft_report (can be documented / adjusted).

Collected Stage Metrics (all ms):

enc_queue_time_ms
enc_compute_time_ms
emb_cache_transfer_time_ms (optional)
prefill_queue_time_ms
prefill_compute_time_ms
Derived when all present: ttft_ms = sum(stage_times).

Transport

Each stage is emitted exactly once as a tiny JSON payload:
{"role":"meta|encoder|pd","request_id":"","<metric_name>":}
Sent via fire‑and‑forget TCP socket (raw HTTP/1.1 POST, no response wait) to minimize tail latency impact; failures are silently dropped.

Proxy Aggregation

Proxy keeps an in‑memory map keyed by request_id, merges first‑write values, and (optionally) logs only on new stage arrival or completion. Query: GET /ttft_report/<request_id> returns merged by_role + combined.

Usage

Set TTFT_ENABLED=1 and VLLM_TTFT_REPORT_URL.
Start services (E, PD, proxy).
Inspect /ttft_report/<request_id> for structured TTFT breakdown.

Does this PR introduce any user-facing change?

Set TTFT_ENABLED=1 and VLLM_TTFT_REPORT_URL.

How was this patch tested?

Details

#!/usr/bin/env bash set -euo pipefail

wait_for_server() {
local port=$1
timeout 12000 bash -c '
until curl -s "http://localhost:'"$port"'/v1/chat/completions" > /dev/null; do
sleep 1
done
' && return 0 || return 1
}

export VLLM_USE_V1=1

LOG_PATH=${LOG_PATH:-./logs}
mkdir -p "$LOG_PATH"

ENCODE_PORT=19534
PREFILL_DECODE_PORT=19535
PROXY_PORT=28886

GPU_E=4
GPU_PD=5

START_TIME=$(date +"%Y%m%d_%H%M%S")
ENC_LOG="$LOG_PATH/encoder.log"
PD_LOG="$LOG_PATH/pd.log"
PROXY_LOG="$LOG_PATH/proxy.log"
PID_FILE="./pid.txt"

SHARED_STORAGE_PATH="/workspace/l00807937/EPD_count/vllm-LJH-for-jiaofu/examples/online_serving/epd/storage"
export VLLM_TTFT_REPORT_URL=http://127.0.0.1:${PROXY_PORT}/ttft_report
export TTFT_ENABLED=True

export VLLM_HTTP_TIMEOUT_KEEP_ALIVE="${VLLM_HTTP_TIMEOUT_KEEP_ALIVE:-70}"
export CLIENT_HTTP_TIMEOUT_KEEP_ALIVE="${CLIENT_HTTP_TIMEOUT_KEEP_ALIVE:-60}"

if (( $CLIENT_HTTP_TIMEOUT_KEEP_ALIVE >= $VLLM_HTTP_TIMEOUT_KEEP_ALIVE )); then
echo "Error: Client keep alive timeout ($CLIENT_HTTP_TIMEOUT_KEEP_ALIVE) should be"
" < server keep alive timeout ($VLLM_HTTP_TIMEOUT_KEEP_ALIVE)."
exit 1
fi

ASCEND_RT_VISIBLE_DEVICES="$GPU_E" vllm serve "/workspace/models/Qwen2.5-VL-7B-Instruct"
--gpu-memory-utilization 0.0
--port "$ENCODE_PORT"
--enable-request-id-headers
--no-enable-prefix-caching
--max-num-seqs 4
--max-model-len 4096
--max-num-batched-tokens 4096
--enforce-eager
--allowed-local-media-path /workspace/l00807937/
--ec-transfer-config '{
"ec_connector": "ECSharedStorageConnector",
"ec_role": "ec_producer",
"ec_connector_extra_config": {
"shared_storage_path": "'"$SHARED_STORAGE_PATH"'",
"ec_max_num_scheduled_tokens": "4096"
}
}'
>"$ENC_LOG" 2>&1 &

echo $! >> "$PID_FILE"

ASCEND_RT_VISIBLE_DEVICES="$GPU_PD" vllm serve "/workspace/models/Qwen2.5-VL-7B-Instruct"
--gpu-memory-utilization 0.9
--port "$PREFILL_DECODE_PORT"
--enable-request-id-headers
--max-num-seqs 128
--enforce-eager
--max-model-len 4096
--max-num-batched-tokens 4096
--allowed-local-media-path /workspace/l00807937/
--ec-transfer-config '{
"ec_connector": "ECSharedStorageConnector",
"ec_role": "ec_consumer",
"ec_connector_extra_config": {
"shared_storage_path": "'"$SHARED_STORAGE_PATH"'"
}
}'
>"$PD_LOG" 2>&1 &

echo $! >> "$PID_FILE"

wait_for_server "$ENCODE_PORT"
wait_for_server "$PREFILL_DECODE_PORT"

python /workspace/l00807937/EPD_count/vllm-LJH-for-jiaofu/examples/online_serving/epd/proxy.py
--host "127.0.0.1"
--port "$PROXY_PORT"
--encode-servers-urls "http://localhost:$ENCODE_PORT"
--prefill-decode-servers-urls "http://localhost:$PREFILL_DECODE_PORT"
>"$PROXY_LOG" 2>&1 &

echo $! >> "$PID_FILE"

echo "All services are up!"

Test Result

INFO:main:[TTFT] report update for 3a8713d1-494c-4ec2-8848-672b4571b72d: {'request_id': '3a8713d1-494c-4ec2-8848-672b4571b72d', 'combined': {'enc_queue_time_ms': 0.002739951014518738, 'enc_compute_time_ms': 1002.8023803606629, 'emb_cache_transfer_time_ms': 2.2039590403437614, 'prefill_queue_time_ms': 0.0, 'prefill_compute_time_ms': 207.5677178800106, 't_request_ingress_ns': None, 'enc_ingress_wait_ms': None, 'ttft_ms': 1212.5767972320318}}
INFO:main:[TTFT] report update for 15879f51-130d-4250-9771-1d8389ae3fd3: {'request_id': '15879f51-130d-4250-9771-1d8389ae3fd3', 'combined': {'enc_queue_time_ms': 0.0034300610423088074, 'enc_compute_time_ms': 56.724236346781254, 'emb_cache_transfer_time_ms': 1.6047926619648933, 'prefill_queue_time_ms': 0.0, 'prefill_compute_time_ms': 28.90088688582182, 't_request_ingress_ns': None, 'enc_ingress_wait_ms': None, 'ttft_ms': 87.23334595561028}}

Signed-off-by: Junhong <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a Time-to-First-Token (TTFT) reporting pipeline, which is a valuable addition for performance monitoring. The implementation correctly adds timing for key stages and reports them. However, I've identified a critical issue in how the feature is enabled via an environment variable, which causes it to be active by default, contrary to the documentation. Additionally, there's a high-severity issue regarding the accuracy of the prefill_compute_time_ms metric in chunked prefill scenarios. Addressing these points will significantly improve the robustness and correctness of this new feature.

gemini-code-assist · 2025-09-29T12:39:32Z

vllm_ascend/worker/model_runner_v1.py

        self.in_profile_run = False

+        # TTFT timecount
+        self.TTFT_ENABLED = os.environ.get("TTFT_ENABLED", "0")


The environment variable TTFT_ENABLED is read as a string. When not set, it defaults to "0". In Python, any non-empty string is truthy, so if "0": evaluates to True. This causes the TTFT feature to be enabled by default, which contradicts the PR description ("default off") and can lead to unexpected overhead and network traffic. The check should be more explicit to only enable the feature for values like "1" or "true".

Suggested change

self.TTFT_ENABLED = os.environ.get("TTFT_ENABLED", "0")

self.TTFT_ENABLED = os.environ.get("TTFT_ENABLED", "0").lower() in ("1", "true")

gemini-code-assist · 2025-09-29T12:39:32Z

vllm_ascend/worker/model_runner_v1.py

+            for req_id in scheduler_output.num_scheduled_tokens.keys():
+                if req_id not in self._ttft_prefill_compute_reported:
+                    rid = req_id
+                    if rid.startswith("chatcmpl-"):
+                        rid = rid[len("chatcmpl-"):]
+                    payload = {
+                        "role": "pd",
+                        "request_id": rid,
+                        "emb_cache_transfer_time_ms": xfer_ms if mm_embeds else 0.0,
+                        "prefill_compute_time_ms": prefill_ms,
+                    }
+                    self._ttft_prefill_compute_reported.add(req_id)
+                    send_ttft_report(payload)


The prefill_compute_time_ms metric currently reports the compute time of the batch that contains the first prefill chunk for a request. If chunked prefill is used, this value does not represent the total prefill compute time for the request, which can be misleading. Since the proxy uses a "first-write wins" merge strategy, subsequent reports for other chunks of the same request would be ignored.

To report the correct total prefill time, you could consider one of the following approaches:

Accumulate the prefill time for each request across all its chunks within the worker and send a single report only when the prefill phase is complete for that request.

Modify the proxy to sum values for the same metric instead of only keeping the first value. This would allow the worker to report the time for each chunk as a partial value.

As it stands, the reported metric is inaccurate for chunked prefill scenarios.

Signed-off-by: Junhong <[email protected]>

LJH-LBJ added 4 commits September 28, 2025 11:25

add EPD timecount

d7b421e

Signed-off-by: Junhong <[email protected]>

opt

582b1e1

Signed-off-by: Junhong <[email protected]>

opt

1d4fbf0

Signed-off-by: Junhong <[email protected]>

opt

6670832

Signed-off-by: Junhong <[email protected]>

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

LJH-LBJ added 7 commits October 10, 2025 11:30

use metrice to log EPD ttft

81325ad

Signed-off-by: Junhong <[email protected]>

fix bug

530959e

Signed-off-by: Junhong <[email protected]>

fix bug

decb4ab

Signed-off-by: Junhong <[email protected]>

record ec_role

7544b90

Signed-off-by: Junhong <[email protected]>

fix bug

e33c6a8

Signed-off-by: Junhong <[email protected]>

opt

1c1434a

Signed-off-by: Junhong <[email protected]>

opt

3e04455

Signed-off-by: Junhong <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TTFT : Staged Encoder (E) & PD Timing with Socket Reporting and Proxy Aggregation #12

TTFT : Staged Encoder (E) & PD Timing with Socket Reporting and Proxy Aggregation #12

Uh oh!

LJH-LBJ commented Sep 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Uh oh!

gemini-code-assist bot Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	self.TTFT_ENABLED = os.environ.get("TTFT_ENABLED", "0")
	self.TTFT_ENABLED = os.environ.get("TTFT_ENABLED", "0").lower() in ("1", "true")

TTFT : Staged Encoder (E) & PD Timing with Socket Reporting and Proxy Aggregation #12

Are you sure you want to change the base?

TTFT : Staged Encoder (E) & PD Timing with Socket Reporting and Proxy Aggregation #12

Uh oh!

Conversation

LJH-LBJ commented Sep 29, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant