Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSeek-R1-Distill-Qwen-32B FP16 model does not work with Triton server + tensorrtllm_backend (but it works with just TensorRT-LLM) #685

Open
2 of 4 tasks
kelkarn opened this issue Jan 30, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@kelkarn
Copy link

kelkarn commented Jan 30, 2025

System Info

Environment

CPU architecture: x86_64
CPU/Host memory size: 220 GiB memory

GPU properties

GPU name: A100
GPU memory size: 80GB
I am using the Azure offering of this GPU: Standard NC24ads A100 v4 (24 vcpus, 220 GiB memory)

Libraries

TensorRT-LLM branch or tag: v0.16.0
Container used: nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 (following the support matrix)

NVIDIA driver version: Driver Version: 535.54.03

OS:

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy

Who can help?

@byshiue @schetlur-nv

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Convert checkpoint after downloading it from HuggingFace (this works fine):
python3 convert_checkpoint.py --model_dir ./tmp/Qwen/32B/ --output_dir ./tllm_checkpoint_1gpu_fp16 --dtype float16 --workers 4

[TensorRT-LLM] TensorRT-LLM version: 0.16.0
0.16.0
518it [00:31, 16.51it/s] 
Total time of converting checkpoints: 00:06:46
  1. Build TRT-LLM engine file (this works fine):
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16 \
--output_dir ./tmp/qwen/32B/trt_engines/fp16/1-gpu \
--gemm_plugin float16 --max_input_len 16384 --reduce_fusion enable \
--use_paged_context_fmha enable --multiple_profiles enable
  1. Copy engine into models directory:
mkdir -p /engines/

# Copy the TRT-LLM engine bits to a common folder.
# Run within /tensorrtllm_backend/tensorrt_llm/examples/qwen
cp ./tmp/qwen/32B/trt_engines/fp16/1-gpu /engines/.

# Set up other moving parts for pre/post processing and ensemble config...
mkdir /triton_model_repo
cp -r /tensorrtllm_backend/all_models/inflight_batcher_llm/* /triton_model_repo/
  1. Run the fill template for filling in various config values:
ENGINE_DIR=/engines/DeepSeek-R1-Distill-Qwen-32B
TOKENIZER_DIR=/tensorrtllm_backend/tensorrt_llm/examples/qwen/tmp/Qwen/32B
MODEL_FOLDER=/triton_model_repo
TRITON_MAX_BATCH_SIZE=4
INSTANCE_COUNT=1
MAX_QUEUE_DELAY_MS=0
MAX_QUEUE_SIZE=0
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py
DECOUPLED_MODE=false

python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE},encoder_input_features_data_type:TYPE_FP16
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}
  1. Run Triton server 24.12 and point to model repo:
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=1 --model_repo=/triton_model_repo

I basically followed all the instructions mentioned here: https://github.com/triton-inference-server/tensorrtllm_backend/tree/v0.16.0

Expected behavior

I expect Triton server to start successfully, and show the DeepSeek-R1-Distill-Qwen-32B model in READY state and the server listening on ports 8000 and 8001 for HTTP and GRPC requests respectively.

actual behavior

I get a CUDA out of memory error like so:

root@nishant-a100-test3:/tensorrtllm_backend/tensorrt_llm/examples/qwen# python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=1 --model_repo=/triton_model_repo
root@nishant-a100-test3:/tensorrtllm_backend/tensorrt_llm/examples/qwen# I0130 02:01:55.022200 527 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7f8364000000' with size 268435456"
I0130 02:01:55.024662 527 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0130 02:01:55.030167 527 model_lifecycle.cc:473] "loading: postprocessing:1"
I0130 02:01:55.030220 527 model_lifecycle.cc:473] "loading: preprocessing:1"
I0130 02:01:55.030306 527 model_lifecycle.cc:473] "loading: tensorrt_llm:1"
I0130 02:01:55.030401 527 model_lifecycle.cc:473] "loading: tensorrt_llm_bls:1"
I0130 02:01:55.111984 527 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)"
I0130 02:01:55.112088 527 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
I0130 02:01:55.145647 527 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I0130 02:01:55.145686 527 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I0130 02:01:55.145691 527 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I0130 02:01:55.145694 527 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"false\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I0130 02:01:55.151095 527 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)"
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] participant_ids is not specified, will be automatically set
I0130 02:01:55.151129 527 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] cross_kv_cache_fraction is not specified, error if it's encoder-decoder model, otherwise ok
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multi_block_mode is not specified, will be set to true
[TensorRT-LLM][WARNING] enable_context_fmha_fp32_acc is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_mode is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_cache_size is not specified, will be set to 0
[TensorRT-LLM][INFO] speculative_decoding_fast_logits is not specified, will be set to false
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa, redrafter, lookahead, eagle}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (131072) * 64
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
I0130 02:01:56.571236 527 model_lifecycle.cc:849] "successfully loaded 'tensorrt_llm_bls'"
[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default.
I0130 02:01:57.379949 527 model_lifecycle.cc:849] "successfully loaded 'postprocessing'"
[TensorRT-LLM][WARNING] 'max_num_images' parameter is not set correctly (value is ${max_num_images}). Will be set to None
[TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default.
I0130 02:01:58.262482 527 model_lifecycle.cc:849] "successfully loaded 'preprocessing'"
[TensorRT-LLM][INFO] Loaded engine size: 62651 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1476.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 62556 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 62556 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 1. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 62556 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 2. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +0, now: CPU 1, GPU 62556 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 3. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 1, GPU 62556 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 4. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 1, GPU 62556 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 5. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 7.16 GB GPU memory for runtime buffers.
E0130 02:02:39.662870 527 backend_model.cc:692] "ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMallocAsync(ptr, n, mMemPool->getPool(), mCudaStream->get()): out of memory (/workspace/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:125)\n1       0x7f82d92ba69a void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 138\n2       0x7f82db02ae96 tensorrt_llm::runtime::BufferManager::gpu(unsigned long, nvinfer1::DataType) const + 534\n3       0x7f82db0396af tensorrt_llm::runtime::DecodingLayerWorkspace::DecodingLayerWorkspace(std::shared_ptr<tensorrt_llm::runtime::BufferManager>, tensorrt_llm::layers::DecoderDomain const&, nvinfer1::DataType, unsigned long) + 1391\n4       0x7f82db083481 tensorrt_llm::runtime::GptDecoder<float>::GptDecoder(tensorrt_llm::executor::DecodingMode const&, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, std::shared_ptr<tensorrt_llm::runtime::CudaStream> const&, std::shared_ptr<tensorrt_llm::runtime::SpeculativeDecodingModule const>) + 705\n5       0x7f82db08fcfb tensorrt_llm::runtime::GptDecoderBatched::setup(tensorrt_llm::executor::DecodingMode const&, int, int, int, int, int, int, nvinfer1::DataType, tensorrt_llm::runtime::ModelConfig const&) + 4123\n6       0x7f82db578af9 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::createDecoder(std::optional<tensorrt_llm::executor::DecodingMode> const&) + 825\n7       0x7f82db58dacf tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 3007\n8       0x7f82db51151e tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 526\n9       0x7f82db628029 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 185\n10      0x7f82db6286bd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229\n11      0x7f82db62990a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474\n12      0x7f82db60f757 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87\n13      0x7f83a43d538e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3238e) [0x7f83a43d538e]\n14      0x7f83a43d1c39 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185\n15      0x7f83a43d2182 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n16      0x7f83a43bf319 TRITONBACKEND_ModelInstanceInitialize + 153\n17      0x7f83afbd8619 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1619) [0x7f83afbd8619]\n18      0x7f83afbd90a2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20a2) [0x7f83afbd90a2]\n19      0x7f83afbbecc3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cc3) [0x7f83afbbecc3]\n20      0x7f83afbbf074 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x188074) [0x7f83afbbf074]\n21      0x7f83afbc865d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19165d) [0x7f83afbc865d]\n22      0x7f83af04cec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x7f83af04cec3]\n23      0x7f83afbb5ee2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17eee2) [0x7f83afbb5ee2]\n24      0x7f83afbc3dac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cdac) [0x7f83afbc3dac]\n25      0x7f83afbc7de2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190de2) [0x7f83afbc7de2]\n26      0x7f83afcc7ca1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x290ca1) [0x7f83afcc7ca1]\n27      0x7f83afccaffc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293ffc) [0x7f83afccaffc]\n28      0x7f83afe276f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f06f5) [0x7f83afe276f5]\n29      0x7f83af392db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7f83af392db4]\n30      0x7f83af047a94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7f83af047a94]\n31      0x7f83af0d4a34 __clone + 68"
E0130 02:02:39.663010 527 model_lifecycle.cc:654] "failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMallocAsync(ptr, n, mMemPool->getPool(), mCudaStream->get()): out of memory (/workspace/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:125)\n1       0x7f82d92ba69a void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 138\n2       0x7f82db02ae96 tensorrt_llm::runtime::BufferManager::gpu(unsigned long, nvinfer1::DataType) const + 534\n3       0x7f82db0396af tensorrt_llm::runtime::DecodingLayerWorkspace::DecodingLayerWorkspace(std::shared_ptr<tensorrt_llm::runtime::BufferManager>, tensorrt_llm::layers::DecoderDomain const&, nvinfer1::DataType, unsigned long) + 1391\n4       0x7f82db083481 tensorrt_llm::runtime::GptDecoder<float>::GptDecoder(tensorrt_llm::executor::DecodingMode const&, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, std::shared_ptr<tensorrt_llm::runtime::CudaStream> const&, std::shared_ptr<tensorrt_llm::runtime::SpeculativeDecodingModule const>) + 705\n5       0x7f82db08fcfb tensorrt_llm::runtime::GptDecoderBatched::setup(tensorrt_llm::executor::DecodingMode const&, int, int, int, int, int, int, nvinfer1::DataType, tensorrt_llm::runtime::ModelConfig const&) + 4123\n6       0x7f82db578af9 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::createDecoder(std::optional<tensorrt_llm::executor::DecodingMode> const&) + 825\n7       0x7f82db58dacf tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 3007\n8       0x7f82db51151e tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 526\n9       0x7f82db628029 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 185\n10      0x7f82db6286bd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229\n11      0x7f82db62990a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474\n12      0x7f82db60f757 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87\n13      0x7f83a43d538e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3238e) [0x7f83a43d538e]\n14      0x7f83a43d1c39 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185\n15      0x7f83a43d2182 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n16      0x7f83a43bf319 TRITONBACKEND_ModelInstanceInitialize + 153\n17      0x7f83afbd8619 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1619) [0x7f83afbd8619]\n18      0x7f83afbd90a2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20a2) [0x7f83afbd90a2]\n19      0x7f83afbbecc3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cc3) [0x7f83afbbecc3]\n20      0x7f83afbbf074 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x188074) [0x7f83afbbf074]\n21      0x7f83afbc865d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19165d) [0x7f83afbc865d]\n22      0x7f83af04cec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x7f83af04cec3]\n23      0x7f83afbb5ee2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17eee2) [0x7f83afbb5ee2]\n24      0x7f83afbc3dac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cdac) [0x7f83afbc3dac]\n25      0x7f83afbc7de2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190de2) [0x7f83afbc7de2]\n26      0x7f83afcc7ca1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x290ca1) [0x7f83afcc7ca1]\n27      0x7f83afccaffc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293ffc) [0x7f83afccaffc]\n28      0x7f83afe276f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f06f5) [0x7f83afe276f5]\n29      0x7f83af392db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7f83af392db4]\n30      0x7f83af047a94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7f83af047a94]\n31      0x7f83af0d4a34 __clone + 68"
I0130 02:02:39.663079 527 model_lifecycle.cc:789] "failed to load 'tensorrt_llm'"
E0130 02:02:39.663208 527 model_repository_manager.cc:703] "Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMallocAsync(ptr, n, mMemPool->getPool(), mCudaStream->get()): out of memory (/workspace/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:125)\n1       0x7f82d92ba69a void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 138\n2       0x7f82db02ae96 tensorrt_llm::runtime::BufferManager::gpu(unsigned long, nvinfer1::DataType) const + 534\n3       0x7f82db0396af tensorrt_llm::runtime::DecodingLayerWorkspace::DecodingLayerWorkspace(std::shared_ptr<tensorrt_llm::runtime::BufferManager>, tensorrt_llm::layers::DecoderDomain const&, nvinfer1::DataType, unsigned long) + 1391\n4       0x7f82db083481 tensorrt_llm::runtime::GptDecoder<float>::GptDecoder(tensorrt_llm::executor::DecodingMode const&, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, std::shared_ptr<tensorrt_llm::runtime::CudaStream> const&, std::shared_ptr<tensorrt_llm::runtime::SpeculativeDecodingModule const>) + 705\n5       0x7f82db08fcfb tensorrt_llm::runtime::GptDecoderBatched::setup(tensorrt_llm::executor::DecodingMode const&, int, int, int, int, int, int, nvinfer1::DataType, tensorrt_llm::runtime::ModelConfig const&) + 4123\n6       0x7f82db578af9 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::createDecoder(std::optional<tensorrt_llm::executor::DecodingMode> const&) + 825\n7       0x7f82db58dacf tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 3007\n8       0x7f82db51151e tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 526\n9       0x7f82db628029 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 185\n10      0x7f82db6286bd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229\n11      0x7f82db62990a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474\n12      0x7f82db60f757 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87\n13      0x7f83a43d538e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3238e) [0x7f83a43d538e]\n14      0x7f83a43d1c39 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185\n15      0x7f83a43d2182 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n16      0x7f83a43bf319 TRITONBACKEND_ModelInstanceInitialize + 153\n17      0x7f83afbd8619 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1619) [0x7f83afbd8619]\n18      0x7f83afbd90a2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20a2) [0x7f83afbd90a2]\n19      0x7f83afbbecc3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cc3) [0x7f83afbbecc3]\n20      0x7f83afbbf074 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x188074) [0x7f83afbbf074]\n21      0x7f83afbc865d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19165d) [0x7f83afbc865d]\n22      0x7f83af04cec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x7f83af04cec3]\n23      0x7f83afbb5ee2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17eee2) [0x7f83afbb5ee2]\n24      0x7f83afbc3dac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cdac) [0x7f83afbc3dac]\n25      0x7f83afbc7de2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190de2) [0x7f83afbc7de2]\n26      0x7f83afcc7ca1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x290ca1) [0x7f83afcc7ca1]\n27      0x7f83afccaffc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293ffc) [0x7f83afccaffc]\n28      0x7f83afe276f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f06f5) [0x7f83afe276f5]\n29      0x7f83af392db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7f83af392db4]\n30      0x7f83af047a94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7f83af047a94]\n31      0x7f83af0d4a34 __clone + 68;"
I0130 02:02:39.663334 527 server.cc:604] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0130 02:02:39.663366 527 server.cc:631] 
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                    |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------+
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/trit |
|             |                                                                 | onserver/backends","min-compute-capability":"6.000000","shm-region-prefix |
|             |                                                                 | -name":"prefix0_","default-max-batch-size":"4"}}                          |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/trit |
|             |                                                                 | onserver/backends","min-compute-capability":"6.000000","default-max-batch |
|             |                                                                 | -size":"4"}}                                                              |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------+

I0130 02:02:39.663434 527 server.cc:674] 
+------------------+---------+------------------------------------------------------------------------------------------------------------------------------+
| Model            | Version | Status                                                                                                                       |
+------------------+---------+------------------------------------------------------------------------------------------------------------------------------+
| postprocessing   | 1       | READY                                                                                                                        |
| preprocessing    | 1       | READY                                                                                                                        |
| tensorrt_llm     | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in ::cuda |
|                  |         | MallocAsync(ptr, n, mMemPool->getPool(), mCudaStream->get()): out of memory (/workspace/tensorrt_llm/cpp/tensorrt_llm/runtim |
|                  |         | e/tllmBuffers.h:125)                                                                                                         |
|                  |         | 3       0x7f82db0396af tensorrt_llm::runtime::DecodingLayerWorkspace::DecodingLayerWorkspace(std::shared_ptr<tensorrt_llm::r |
|                  |         | untime::BufferManager>, tensorrt_llm::layers::DecoderDomain const&, nvinfer1::DataType, unsigned long) + 1391                |
|                  |         | 5       0x7f82db08fcfb tensorrt_llm::runtime::GptDecoderBatched::setup(tensorrt_llm::executor::DecodingMode const&, int, int |
|                  |         | , int, int, int, int, nvinfer1::DataType, tensorrt_llm::runtime::ModelConfig const&) + 4123                                  |
|                  |         | 7       0x7f82db58dacf tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr |
|                  |         | <nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::run |
|                  |         | time::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 3007                          |
|                  |         | 10      0x7f82db6286bd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const |
|                  |         | &, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::Gp |
|                  |         | tJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<c |
|                  |         | har, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<ch |
|                  |         | ar, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_tr |
|                  |         | aits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229                               |
|                  |         | 5       0x7f82db08fcfb tensorrt_llm::runtime::GptDecoderBatched::setup(tensorrt_llm::executor::DecodingMode const&, int, int, int, int, int, int, nvinfer1::DataType, tensorrt_llm::runtime::ModelConfig const&) + 4123 |
|                  |         | 6       0x7f82db578af9 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::createDecoder(std::optional<tensorrt_llm::executor::DecodingMode> const&) + 825 |
|                  |         | 7       0x7f82db58dacf tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 3007 |
|                  |         | 8       0x7f82db51151e tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 526 |
|                  |         | 9       0x7f82db628029 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 185 |
|                  |         | 10      0x7f82db6286bd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229 |
|                  |         | 11      0x7f82db62990a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474 |
|                  |         | 12      0x7f82db60f757 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87 |
|                  |         | 13      0x7f83a43d538e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3238e) [0x7f83a43d538e]            |
|                  |         | 14      0x7f83a43d1c39 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185 |
|                  |         | 15      0x7f83a43d2182 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66 |
|                  |         | 16      0x7f83a43bf319 TRITONBACKEND_ModelInstanceInitialize + 153                                                           |
|                  |         | 17      0x7f83afbd8619 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1619) [0x7f83afbd8619]                           |
|                  |         | 18      0x7f83afbd90a2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20a2) [0x7f83afbd90a2]                           |
|                  |         | 19      0x7f83afbbecc3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cc3) [0x7f83afbbecc3]                           |
|                  |         | 20      0x7f83afbbf074 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x188074) [0x7f83afbbf074]                           |
|                  |         | 21      0x7f83afbc865d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19165d) [0x7f83afbc865d]                           |
|                  |         | 22      0x7f83af04cec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x7f83af04cec3]                                        |
|                  |         | 23      0x7f83afbb5ee2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17eee2) [0x7f83afbb5ee2]                           |
|                  |         | 24      0x7f83afbc3dac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cdac) [0x7f83afbc3dac]                           |
|                  |         | 25      0x7f83afbc7de2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190de2) [0x7f83afbc7de2]                           |
|                  |         | 26      0x7f83afcc7ca1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x290ca1) [0x7f83afcc7ca1]                           |
|                  |         | 27      0x7f83afccaffc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293ffc) [0x7f83afccaffc]                           |
| tensorrt_llm_bls | 1       | READY                                                                                                                        |
+------------------+---------+------------------------------------------------------------------------------------------------------------------------------+

I0130 02:02:39.690771 527 metrics.cc:890] "Collecting metrics for GPU 0: NVIDIA A100 80GB PCIe"
I0130 02:02:39.698015 527 metrics.cc:783] "Collecting CPU metrics"
I0130 02:02:39.698329 527 tritonserver.cc:2598] 
+----------------------------------+------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                  |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                 |
| server_version                   | 2.53.0                                                                                                                 |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration syste |
|                                  | m_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging                              |
| model_repository_path[0]         | /triton_model_repo                                                                                                     |
| model_control_mode               | MODE_NONE                                                                                                              |
| strict_model_config              | 1                                                                                                                      |
| model_config_name                |                                                                                                                        |
| rate_limit                       | OFF                                                                                                                    |
| pinned_memory_pool_byte_size     | 268435456                                                                                                              |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                               |
| min_supported_compute_capability | 6.0                                                                                                                    |
| strict_readiness                 | 1                                                                                                                      |
| exit_timeout                     | 30                                                                                                                     |
| cache_enabled                    | 0                                                                                                                      |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------+

I0130 02:02:39.698374 527 server.cc:305] "Waiting for in-flight requests to complete."
I0130 02:02:39.698388 527 server.cc:321] "Timeout 30: Found 0 model versions that have in-flight inferences"
I0130 02:02:39.699085 527 server.cc:336] "All models are stopped, unloading models"
I0130 02:02:39.699099 527 server.cc:345] "Timeout 30: Found 3 live models and 0 in-flight non-inference requests"
I0130 02:02:40.699248 527 server.cc:345] "Timeout 29: Found 3 live models and 0 in-flight non-inference requests"
Cleaning up...
Cleaning up...
Cleaning up...
I0130 02:02:41.059079 527 model_lifecycle.cc:636] "successfully unloaded 'tensorrt_llm_bls' version 1"
I0130 02:02:41.116831 527 model_lifecycle.cc:636] "successfully unloaded 'postprocessing' version 1"
I0130 02:02:41.246758 527 model_lifecycle.cc:636] "successfully unloaded 'preprocessing' version 1"
I0130 02:02:41.699404 527 server.cc:345] "Timeout 28: Found 0 live models and 0 in-flight non-inference requests"
error: creating server: Internal - failed to load all models
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[23670,1],0]
  Exit code:    1
--------------------------------------------------------------------------

When I try using the basic run command here: https://github.com/NVIDIA/TensorRT-LLM/blob/v0.16.0/examples/qwen/README.md#run on the same machine, it works!

root@nishant-a100-test3:/tensorrtllm_backend/tensorrt_llm/examples/qwen# python3 ../run.py --input_text "why is the sky blue?" --max_output_len=100 --tokenize
r_dir ./tmp/Qwen/32B/ --engine_dir=/engines/DeepSeek-R1-Distill-Qwen-32B
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[01/30/2025-02:28:43] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (131072) * 64
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 62651 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1476.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 62556 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 62556 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 1. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 62556 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 2. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +0, now: CPU 1, GPU 62556 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 3. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 1, GPU 62556 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 4. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 1, GPU 62556 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 5. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 3.61 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.38 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.15 GiB, available: 15.22 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 877
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][WARNING] maxAttentionWindow and maxSequenceLen are too large for at least one sequence to fit in kvCache. they are reduced to 56128
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 877
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 13.70 GiB for max tokens in paged KV cache (56128).
[01/30/2025-02:29:22] [TRT-LLM] [I] Load engine takes: 39.42486357688904 sec
Input [Text 0]: "<|begin▁of▁sentence|><|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
why is the sky blue?<|im_end|>
<|im_start|>assistant
"
Output [Text 0 Beam 0]: "The sky appears blue because of a phenomenon called Rayleigh scattering. When sunlight reaches Earth's atmosphere, it interacts with molecules and small particles in the air. Sunlight is made up of different colors, each with its own wavelength. Blue light has a shorter wavelength compared to other colors like red or orange.

As sunlight passes through the atmosphere, the shorter blue wavelengths are scattered in all directions by the molecules and particles, primarily nitrogen and oxygen. This scattering is much more effective for blue light than for the"
[TensorRT-LLM][INFO] Refreshed the MPI local session

Why does the model not load and run out of Cuda memory with Triton server but works with regular TRT-LLM? I checked the model size, and it is ~65GB, so it should fit within a A100 GPU with 80GB GPU memory.

additional notes

I was reading in some places that setting kv_cache_free_gpu_mem_fraction helps limit overall memory usage, and that by default it is set to a high value (0.9). I changed this config to 0.1 and re-tried, but I get the same above error.

@kelkarn kelkarn added the bug Something isn't working label Jan 30, 2025
@kelkarn kelkarn changed the title DeepSeek-R1-Distill-Qwen-32B FP16 model does not work with Triton server + tensorrtllm_backend (but works with just TensorRT-LLM) DeepSeek-R1-Distill-Qwen-32B FP16 model does not work with Triton server + tensorrtllm_backend (but it works with just TensorRT-LLM) Jan 30, 2025
@kelkarn
Copy link
Author

kelkarn commented Jan 30, 2025

Also as a note: I tried a int4 quantized version of the same above model and its size comes to ~19GB. And I was able to serve that model successfully from within Triton 24.12.

@kelkarn
Copy link
Author

kelkarn commented Jan 30, 2025

It looks like the issue was what is mentioned in here: NVIDIA/TensorRT-LLM#260

I reduced the max_batch_size to 1 (default is 2048) and now I see that I am able to load and work with the full FP16 non-quantized model in Triton.

Here is the trtllm-build command that I used (the checkpoint conversion command is the same as the above one):

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16 --output_dir ./tmp/qwen/32B
/trt_engines/fp16/1-gpu --gemm_plugin float16 --max_input_len 16384 --max_batch_size 1 --max_beam_width 3 --reduce_fusion enable --use_paged_context_fmha enable --multiple_profiles enable

@byshiue @juney-nvidia - can we please mention the effect of max_batch_size for these models on the overall working memory that Triton uses, in the tensorrtllm_backend README doc? It seems this is an important consideration and point of difference between running the model in Triton vs. running it via run.py. I believe I have seen this OOM error before too and it has always been confusing why models that are expected to fit on an A100 do not fit as so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant