adding Context Length Specialization (CCL) #565

vjanfaza · 2025-09-19T06:02:58Z

Compute-Context-Length (CCL) technique optimizes the throughput of large language models (LLMs) on Qualcomm devices when handling very large context lengths. The current Ahead Of Time (AOT) compilation on Qualcomm devices doesn't predict the number of tokens needed, leading to significant throughput drops during the prefilling and the decoding phases. This happens because the system performs attention calculations based on large context length. To address this issue, we introduce Compute Context Length (CCL), an additional ONNX variable that allows for dynamic context-length specialization. By generating tokens using smaller, more manageable context lengths (CCL), we optimize memory reads and attention calculations, thereby improving throughput.

Signed-off-by: Vahid Janfaza <[email protected]>

adding Context Length Specialization (CCL)

b9b2f54

Signed-off-by: Vahid Janfaza <[email protected]>

vjanfaza requested review from quic-rishinr, ochougul, quic-hemagnih and quic-amitraj as code owners September 19, 2025 06:02

adding Context Length Specialization (CCL)

3018feb

Signed-off-by: Vahid Janfaza <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

adding Context Length Specialization (CCL) #565

adding Context Length Specialization (CCL) #565

vjanfaza commented Sep 19, 2025

Uh oh!

Uh oh!

adding Context Length Specialization (CCL) #565

Are you sure you want to change the base?

adding Context Length Specialization (CCL) #565

Conversation

vjanfaza commented Sep 19, 2025

Uh oh!

Uh oh!