You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, FlexPrefill is only implemented during the prefilling phase. For a detailed explanation of why it isn’t applied to the decoding stage, please refer to this issue: #1.
Regarding kernel benchmarking, the sparsity in FlexPrefill depends on the actual input to the LLM, making it impractical to benchmark with random inputs. You can run the script tests/test_llm.py to perform a real model test, which will provide an example time for the total model generation process. For a more controlled test, we’ve recently uploaded a new file for benchmarking the kernel with a fixed sparsity ratio. You can use the script tests/kernel_benchmark.py to assess the kernel’s performance in this scenario.
If you have any further questions or need additional assistance, please feel free to reach out.
Thanks for you great work. I wonder how this method performs on decoding stage. And is there any benchmark for the kernel?
The text was updated successfully, but these errors were encountered: