Any analyse or speedup test for decoding stage? #2

foreverpiano · 2025-03-03T15:14:56Z

Thanks for you great work. I wonder how this method performs on decoding stage. And is there any benchmark for the kernel?

XunhaoLai · 2025-03-06T08:27:50Z

Hi @foreverpiano , thank you for your interest.

Currently, FlexPrefill is only implemented during the prefilling phase. For a detailed explanation of why it isn’t applied to the decoding stage, please refer to this issue: #1.

Regarding kernel benchmarking, the sparsity in FlexPrefill depends on the actual input to the LLM, making it impractical to benchmark with random inputs. You can run the script tests/test_llm.py to perform a real model test, which will provide an example time for the total model generation process. For a more controlled test, we’ve recently uploaded a new file for benchmarking the kernel with a fixed sparsity ratio. You can use the script tests/kernel_benchmark.py to assess the kernel’s performance in this scenario.

If you have any further questions or need additional assistance, please feel free to reach out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any analyse or speedup test for decoding stage? #2

Any analyse or speedup test for decoding stage? #2

foreverpiano commented Mar 3, 2025 •

edited

Loading

XunhaoLai commented Mar 6, 2025

Any analyse or speedup test for decoding stage? #2

Any analyse or speedup test for decoding stage? #2

Comments

foreverpiano commented Mar 3, 2025 • edited Loading

XunhaoLai commented Mar 6, 2025

foreverpiano commented Mar 3, 2025 •

edited

Loading