Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
-
Updated
Nov 28, 2024 - Cuda
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
[NeurIPS 2024] AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
⚡️ A fast and flexible PyTorch inference server that runs locally, on any cloud or AI HW.
This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"
DepthStream Accelerator: A TensorRT-optimized monocular depth estimation tool with ROS2 integration for C++. It offers high-speed, accurate depth perception, perfect for real-time applications in robotics, autonomous vehicles, and interactive 3D environments.
Convert and run scikit-learn MLPs on Rockchip NPU.
Code for paper "TLEE: Temporal-wise and Layer-wise Early Exiting Network for Efficient Video Recognition on Edge Devices"
Modified inference engine for quantized convolution using product quantization
Code for paper "Deep Reinforcement Learning based Multi-task Automated Channel Pruning for DNNs"
Add a description, image, and links to the inference-acceleration topic page so that developers can more easily learn about it.
To associate your repository with the inference-acceleration topic, visit your repo's landing page and select "manage topics."