Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any analyse or speedup test for decoding stage? #2

Open
foreverpiano opened this issue Mar 3, 2025 · 1 comment
Open

Any analyse or speedup test for decoding stage? #2

foreverpiano opened this issue Mar 3, 2025 · 1 comment

Comments

@foreverpiano
Copy link

foreverpiano commented Mar 3, 2025

Thanks for you great work. I wonder how this method performs on decoding stage. And is there any benchmark for the kernel?

@XunhaoLai
Copy link
Collaborator

Hi @foreverpiano , thank you for your interest.

Currently, FlexPrefill is only implemented during the prefilling phase. For a detailed explanation of why it isn’t applied to the decoding stage, please refer to this issue: #1.

Regarding kernel benchmarking, the sparsity in FlexPrefill depends on the actual input to the LLM, making it impractical to benchmark with random inputs. You can run the script tests/test_llm.py to perform a real model test, which will provide an example time for the total model generation process. For a more controlled test, we’ve recently uploaded a new file for benchmarking the kernel with a fixed sparsity ratio. You can use the script tests/kernel_benchmark.py to assess the kernel’s performance in this scenario.

If you have any further questions or need additional assistance, please feel free to reach out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants