inference-acceleration

Here are 9 public repositories matching this topic...

thu-ml / SageAttention

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

cuda triton attention quantization video-generation inference-acceleration llm

Updated Nov 28, 2024
Cuda

czg1225 / AsyncDiff

Star

[NeurIPS 2024] AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

distributed-computing text-to-image efficient-inference diffusion-models text-to-video inference-acceleration stable-diffusion training-free

Updated Sep 27, 2024
Python

autonomi-ai / nos

Star

⚡️ A fast and flexible PyTorch inference server that runs locally, on any cloud or AI HW.

machine-learning computer-vision inference inference-acceleration generative-ai llm-inference

Updated Jun 8, 2024
Python

dvlab-research / Q-LLM

Star

This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"

fast-inference inference-acceleration large-language-models long-context kv-cache-compression

Updated Jul 16, 2024
Python

jagennath-hari / DepthStream-Accelerator-ROS2-Integrated-Monocular-Depth-Inference

Star

DepthStream Accelerator: A TensorRT-optimized monocular depth estimation tool with ROS2 integration for C++. It offers high-speed, accurate depth perception, perfect for real-time applications in robotics, autonomous vehicles, and interactive 3D environments.

computer-vision deep-learning robotics depth-estimation ros2 monocular-depth-estimation inference-acceleration tensorrt-inference vision-tranformer