Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper - Missing parameters for triton deployment using tensorrt_llm backend #672

Open
2 of 4 tasks
eleapttn opened this issue Jan 2, 2025 · 0 comments
Open
2 of 4 tasks
Labels
bug Something isn't working

Comments

@eleapttn
Copy link

eleapttn commented Jan 2, 2025

System Info

Hello,

I'm trying to deploy Whisper large-v3 using Triton and tensorrtllm backend using this readme: https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.16.0/docs/whisper.md

Context

  • hardware: L40S
  • version of tensorrtllm_backend: v0.16.0
  • checkpoint conversion done (success)
  • TensorRT-LLM engines building done (success)

Issues

However, I have some issues when I'm trying to go to step 3 (Prepare Tritonserver configs) due to missing parameters to fill in config file using the following script:

python3 tools/fill_template.py -i model_repo_whisper/tensorrt_llm/config.pbtxt triton_backend:${BACKEND},engine_dir:${DECODER_ENGINE_PATH},encoder_engine_dir:${ENCODER_ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_policy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},max_queue_size:${MAX_QUEUE_SIZE},enable_context_fmha_fp32_acc:${ENABLE_CONTEXT_FMHA_FP32_ACC},cross_kv_cache_fraction:${CROSS_KV_CACHE_FRACTION},encoder_input_features_data_type:TYPE_FP16

My questions are:

  • Why do we need a tensorrt_llm "model" to run the triton server for whisper_bls ?
  • If it's required, how to set up these parameters for a Whisper model?

Thank you 🙂

Who can help?

@juney-nvidia

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

In the https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.16.0/docs/whisper.md, at step 3:

BACKEND=tensorrtllm
DECOUPLED_MODE=false
DECODER_ENGINE_PATH=${output_dir}/decoder
ENCODER_ENGINE_PATH=${output_dir}/encoder
MAX_TOKENS_IN_KV_CACHE=24000
BATCHING_STRATEGY=inflight_fused_batching
KV_CACHE_FREE_GPU_MEM_FRACTION=0.5
EXCLUDE_INPUT_IN_OUTPUT=True
TRITON_MAX_BATCH_SIZE=8
MAX_QUEUE_DELAY_MICROSECONDS=0
MAX_BEAM_WIDTH=1
MAX_QUEUE_SIZE="0"
ENABLE_KV_CACHE_REUSE=false
ENABLE_CHUNKED_CONTEXT=false
CROSS_KV_CACHE_FRACTION="0.5"
n_mels=128
zero_pad=false

python3 tools/fill_template.py -i model_repo_whisper/tensorrt_llm/config.pbtxt triton_backend:${BACKEND},engine_dir:${DECODER_ENGINE_PATH},encoder_engine_dir:${ENCODER_ENGINE_PATH},decoupled_mode:${DECOUPLED_MODE},max_tokens_in_paged_kv_cache:${MAX_TOKENS_IN_KV_CACHE},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},batch_scheduler_policy:${BATCH_SCHEDULER_POLICY},batching_strategy:${BATCHING_STRATEGY},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MICROSECONDS},max_beam_width:${MAX_BEAM_WIDTH},enable_kv_cache_reuse:${ENABLE_KV_CACHE_REUSE},normalize_log_probs:${NORMALIZE_LOG_PROBS},enable_chunked_context:${ENABLE_CHUNKED_CONTEXT},gpu_device_ids:${GPU_DEVICE_IDS},decoding_mode:${DECODING_MODE},max_queue_size:${MAX_QUEUE_SIZE},enable_context_fmha_fp32_acc:${ENABLE_CONTEXT_FMHA_FP32_ACC},cross_kv_cache_fraction:${CROSS_KV_CACHE_FRACTION},encoder_input_features_data_type:TYPE_FP16

Expected behavior

Variable not found when running the script:

python3 tools/fill_template.py -i model_repo_whisper/tensorrt_llm/config.pbtxt ...

Or in tritonserver logs:

[libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:337] Error parsing text-format inference.ModelConfig: 105:16: Expected integer or identifier, got: $
E0102 18:16:16.688605 46342 model_repository_manager.cc:1460] "Poll failed for model directory 'tensorrt_llm': failed to read text proto from /workspace/model_repo/l40s/openai_whisper-large-v3_int8/tensorrt_llm/config.pbtxt"

actual behavior

Missing parameters to fill the config.pbtxtx

additional notes

Tried to add the parameters as follow but still missing other parameters

MAX_ATTENTION_WINDOW_SIZE=448
BATCH_SCHEDULER_POLICY=max_utilization
NORMALIZE_LOG_PROBS=false
GPU_DEVICE_IDS=""
DECODING_MODE=""
ENABLE_CONTEXT_FMHA_FP32_ACC=true
@eleapttn eleapttn added the bug Something isn't working label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant