use guidellm to test vllm benchmark (with litellm proxy), failed #312

liyuerich · 2025-09-09T08:13:08Z

liyuerich
Sep 9, 2025

try to use guidellm to test vllm benchmark (with litellm proxy), failed

1 setup vllm service
vllm serve /root/data/Qwen/Qwen2-0.5B-Instruct --port 8001 --served-model-name Qwen2-0.5B-Instruct

2 setup litellm proxy

litellm_config.yaml

model_list:
  - model_name: "vllm-model-group-1"
    litellm_params:
      model: "hosted_vllm/Qwen2-0.5B-Instruct"
      api_base: "http://vllmserverIP:8001/v1"
      request_timeout: 120

litellm --config config.yaml --port 4000

3 use curl to access vllm service and litellm proxy directly , both works fine.

4 run guidellm , failed

guidellm benchmark --target "http://localhost:4000" --rate-type sweep --max-seconds 30 --data "prompt_tokens=5,output_tokens=2"

guidellm benchmark   --target "http://localhost:4000"   --rate-type sweep   --data "prompt_tokens=5,output_tokens=2"
Creating backend...
25-09-09 01:09:14|ERROR            |guidellm.backend.openai:text_completions:306 - OpenAIHTTPBackend request with headers: {'Content-Type': 'application/json'} and params: {} and payload: {'prompt': 'Test connection', 'model': 'vllm-model-group-1', 'stream': True, 'stream_options': {'include_usage': True}, 'max_tokens': 1, 'max_completion_tokens': 1, 'stop': None, 'ignore_eos': True} failed: Server error '500 Internal Server Error' for url 'http://localhost:4000/v1/completions'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500
Traceback (most recent call last):
  File "/usr/local/bin/guidellm", line 7, in <module>
    sys.exit(cli())
             ^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/guidellm/__main__.py", line 314, in run
    asyncio.run(
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/guidellm/benchmark/entrypoints.py", line 29, in benchmark_with_scenario
    return await benchmark_generative_text(**vars(scenario), **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/guidellm/benchmark/entrypoints.py", line 71, in benchmark_generative_text
    await backend.validate()
  File "/usr/local/lib/python3.12/dist-packages/guidellm/backend/backend.py", line 143, in validate
    async for _ in self.text_completions(  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.12/dist-packages/guidellm/backend/openai.py", line 314, in text_completions
    raise ex
  File "/usr/local/lib/python3.12/dist-packages/guidellm/backend/openai.py", line 295, in text_completions
    async for resp in self._iterative_completions_request(
  File "/usr/local/lib/python3.12/dist-packages/guidellm/backend/openai.py", line 605, in _iterative_completions_request
    stream.raise_for_status()
  File "/usr/local/lib/python3.12/dist-packages/httpx/_models.py", line 829, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Server error '500 Internal Server Error' for url 'http://localhost:4000/v1/completions'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500

Answered by mubashir1osmani

Sep 10, 2025

litellm supports max_completion_tokens - try adding litellm.drop_params=True

ref: https://docs.litellm.ai/docs/completion/input

View full answer

liyuerich · 2025-09-09T08:13:39Z

liyuerich
Sep 9, 2025
Author

try to ask RunLLM, the answer is not helpful.

0 replies

sjmonson · 2025-09-09T14:57:23Z

sjmonson
Sep 9, 2025
Maintainer

Hi 500 Internal Server Error means the error is on litellm's side. Do you have any logs from litellm?

0 replies

liyuerich · 2025-09-10T02:59:57Z

liyuerich
Sep 10, 2025
Author

get detailed debug logs from litellm

19:46:37 - LiteLLM Proxy:DEBUG: common_request_processing.py:385 - Request received by LiteLLM:
{
    "prompt": "Test connection",
    "model": "vllm-model-group-1",
    "stream": true,
    "stream_options": {
        "include_usage": true
    },
    "max_tokens": 1,
    "max_completion_tokens": 1,
    "stop": null,
    "ignore_eos": true
}
19:46:37 - LiteLLM Proxy:DEBUG: litellm_pre_call_utils.py:759 - Request Headers: Headers({'host': 'localhost:4000', 'accept': '*/*', 'accept-encoding': 'gzip, deflate', 'connection': 'keep-alive', 'user-agent': 'python-httpx/0.28.1', 'content-type': 'application/json', 'content-length': '182'})
19:46:37 - LiteLLM Proxy:DEBUG: litellm_pre_call_utils.py:765 - receiving data: {'prompt': 'Test connection', 'model': 'vllm-model-group-1', 'stream': True, 'stream_options': {'include_usage': True}, 'max_tokens': 1, 'max_completion_tokens': 1, 'stop': None, 'ignore_eos': True, 'proxy_server_request': {'url': 'http://localhost:4000/v1/completions', 'method': 'POST', 'headers': {'host': 'localhost:4000', 'accept': '*/*', 'accept-encoding': 'gzip, deflate', 'connection': 'keep-alive', 'user-agent': 'python-httpx/0.28.1', 'content-type': 'application/json', 'content-length': '182'}, 'body': {'prompt': 'Test connection', 'model': 'vllm-model-group-1', 'stream': True, 'stream_options': {'include_usage': True}, 'max_tokens': 1, 'max_completion_tokens': 1, 'stop': None, 'ignore_eos': True}}, 'metadata': {}, 'secret_fields': {'raw_headers': {'host': 'localhost:4000', 'accept': '*/*', 'accept-encoding': 'gzip, deflate', 'connection': 'keep-alive', 'user-agent': 'python-httpx/0.28.1', 'content-type': 'application/json', 'content-length': '182'}}}
19:46:37 - LiteLLM Proxy:DEBUG: litellm_pre_call_utils.py:929 - [PROXY] returned data from litellm_pre_call_utils: {'prompt': 'Test connection', 'model': 'vllm-model-group-1', 'stream': True, 'stream_options': {'include_usage': True}, 'max_tokens': 1, 'max_completion_tokens': 1, 'stop': None, 'ignore_eos': True, 'proxy_server_request': {'url': 'http://localhost:4000/v1/completions', 'method': 'POST', 'headers': {'host': 'localhost:4000', 'accept': '*/*', 'accept-encoding': 'gzip, deflate', 'connection': 'keep-alive', 'user-agent': 'python-httpx/0.28.1', 'content-type': 'application/json', 'content-length': '182'}, 'body': {'prompt': 'Test connection', 'model': 'vllm-model-group-1', 'stream': True, 'stream_options': {'include_usage': True}, 'max_tokens': 1, 'max_completion_tokens': 1, 'stop': None, 'ignore_eos': True}}, 'metadata': {'requester_metadata': {}, 'user_api_key_hash': None, 'user_api_key_alias': None, 'user_api_key_team_id': None, 'user_api_key_user_id': None, 'user_api_key_org_id': None, 'user_api_key_team_alias': None, 'user_api_key_end_user_id': None, 'user_api_key_user_email': None, 'user_api_key_request_route': '/v1/completions', 'user_api_key': None, 'user_api_end_user_max_budget': None, 'litellm_api_version': '1.76.3', 'global_max_parallel_requests': None, 'user_api_key_team_max_budget': None, 'user_api_key_team_spend': None, 'user_api_key_spend': 0.0, 'user_api_key_max_budget': None, 'user_api_key_model_max_budget': {}, 'user_api_key_metadata': {}, 'headers': {'host': 'localhost:4000', 'accept': '*/*', 'accept-encoding': 'gzip, deflate', 'connection': 'keep-alive', 'user-agent': 'python-httpx/0.28.1', 'content-type': 'application/json', 'content-length': '182'}, 'endpoint': 'http://localhost:4000/v1/completions', 'litellm_parent_otel_span': None, 'requester_ip_address': ''}, 'secret_fields': {'raw_headers': {'host': 'localhost:4000', 'accept': '*/*', 'accept-encoding': 'gzip, deflate', 'connection': 'keep-alive', 'user-agent': 'python-httpx/0.28.1', 'content-type': 'application/json', 'content-length': '182'}}}
19:46:37 - LiteLLM:DEBUG: utils.py:350 - Initialized litellm callbacks, Async Success Callbacks: [<bound method Router.deployment_callback_on_success of <litellm.router.Router object at 0x7fd70eff6180>>, <litellm.proxy.hooks.model_max_budget_limiter._PROXY_VirtualKeyModelMaxBudgetLimiter object at 0x7fd70e942d50>, <litellm.proxy.hooks.max_budget_limiter._PROXY_MaxBudgetLimiter object at 0x7fd70e20aa50>, <litellm.proxy.hooks.parallel_request_limiter._PROXY_MaxParallelRequestsHandler object at 0x7fd70dac07d0>, <litellm.proxy.hooks.cache_control_check._PROXY_CacheControlCheck object at 0x7fd70dac0770>, <litellm._service_logger.ServiceLogging object at 0x7fd70ead8bc0>]
19:46:37 - LiteLLM:DEBUG: litellm_logging.py:474 - self.optional_params: {}
19:46:37 - LiteLLM Proxy:DEBUG: utils.py:851 - Inside Proxy Logging Pre-call hook!
19:46:37 - LiteLLM Proxy:DEBUG: max_budget_limiter.py:23 - Inside Max Budget Limiter Pre-Call Hook
19:46:37 - LiteLLM Proxy:DEBUG: parallel_request_limiter.py:48 - Inside Max Parallel Request Pre-Call Hook
19:46:37 - LiteLLM Proxy:DEBUG: cache_control_check.py:27 - Inside Cache Control Check Pre-Call Hook
19:46:37 - LiteLLM Router:DEBUG: router.py:3955 - Inside async function with retries.
19:46:37 - LiteLLM Router:DEBUG: router.py:3975 - async function w/ retries: original_function - <bound method Router._atext_completion of <litellm.router.Router object at 0x7fd70eff6180>>, num_retries - 2
19:46:37 - LiteLLM Router:DEBUG: router.py:2508 - Inside _atext_completion()- model: vllm-model-group-1; kwargs: {'stream': True, 'stream_options': {'include_usage': True}, 'max_tokens': 1, 'max_completion_tokens': 1, 'stop': None, 'ignore_eos': True, 'proxy_server_request': {'url': 'http://localhost:4000/v1/completions', 'method': 'POST', 'headers': {'host': 'localhost:4000', 'accept': '*/*', 'accept-encoding': 'gzip, deflate', 'connection': 'keep-alive', 'user-agent': 'python-httpx/0.28.1', 'content-type': 'application/json', 'content-length': '182'}, 'body': {'prompt': 'Test connection', 'model': 'vllm-model-group-1', 'stream': True, 'stream_options': {'include_usage': True}, 'max_tokens': 1, 'max_completion_tokens': 1, 'stop': None, 'ignore_eos': True}}, 'metadata': {'requester_metadata': {}, 'user_api_key_hash': None, 'user_api_key_alias': None, 'user_api_key_team_id': None, 'user_api_key_user_id': None, 'user_api_key_org_id': None, 'user_api_key_team_alias': None, 'user_api_key_end_user_id': None, 'user_api_key_user_email': None, 'user_api_key_request_route': '/v1/completions', 'user_api_key': None, 'user_api_end_user_max_budget': None, 'litellm_api_version': '1.76.3', 'global_max_parallel_requests': None, 'user_api_key_team_max_budget': None, 'user_api_key_team_spend': None, 'user_api_key_spend': 0.0, 'user_api_key_max_budget': None, 'user_api_key_model_max_budget': {}, 'user_api_key_metadata': {}, 'headers': {'host': 'localhost:4000', 'accept': '*/*', 'accept-encoding': 'gzip, deflate', 'connection': 'keep-alive', 'user-agent': 'python-httpx/0.28.1', 'content-type': 'application/json', 'content-length': '182'}, 'endpoint': 'http://localhost:4000/v1/completions', 'litellm_parent_otel_span': None, 'requester_ip_address': '', 'model_group': 'vllm-model-group-1', 'model_group_alias': None, 'model_group_size': 1}, 'secret_fields': {'raw_headers': {'host': 'localhost:4000', 'accept': '*/*', 'accept-encoding': 'gzip, deflate', 'connection': 'keep-alive', 'user-agent': 'python-httpx/0.28.1', 'content-type': 'application/json', 'content-length': '182'}}, 'litellm_call_id': '1d367784-6898-40bd-84c4-05bbcee96966', 'litellm_logging_obj': <litellm.litellm_core_utils.litellm_logging.Logging object at 0x7fd70dac3170>, 'litellm_trace_id': '886c574a-5605-4e70-aa35-56c0f9e4d4cf'}
19:46:37 - LiteLLM Router:DEBUG: router.py:6706 - initial list of deployments: [{'model_name': 'vllm-model-group-1', 'litellm_params': {'api_base': 'http://10.233.108.232:8001/v1', 'use_in_pass_through': False, 'use_litellm_proxy': False, 'merge_reasoning_content_in_choices': False, 'model': 'hosted_vllm/Qwen2-0.5B-Instruct', 'request_timeout': 120}, 'model_info': {'id': 'ed57857e2678045993539feb1c4691baa14c6f94ac3a814c0e52f9daa89fca2f', 'db_model': False}}]
19:46:37 - LiteLLM Router:DEBUG: cooldown_handlers.py:342 - retrieve cooldown models: []
19:46:37 - LiteLLM Router:DEBUG: router.py:6773 - async cooldown deployments: []
19:46:37 - LiteLLM Router:DEBUG: router.py:6776 - cooldown_deployments: []
19:46:37 - LiteLLM Router:DEBUG: router.py:7141 - cooldown deployments: []
19:46:37 - LiteLLM:DEBUG: utils.py:350 - 

19:46:37 - LiteLLM:DEBUG: utils.py:350 - Request to litellm:
19:46:37 - LiteLLM:DEBUG: utils.py:350 - litellm.atext_completion(api_base='http://10.233.108.232:8001/v1', use_in_pass_through=False, use_litellm_proxy=False, merge_reasoning_content_in_choices=False, model='hosted_vllm/Qwen2-0.5B-Instruct', request_timeout=120, prompt='Test connection', caching=False, client=None, stream=True, stream_options={'include_usage': True}, max_tokens=1, max_completion_tokens=1, stop=None, ignore_eos=True, proxy_server_request={'url': 'http://localhost:4000/v1/completions', 'method': 'POST', 'headers': {'host': 'localhost:4000', 'accept': '*/*', 'accept-encoding': 'gzip, deflate', 'connection': 'keep-alive', 'user-agent': 'python-httpx/0.28.1', 'content-type': 'application/json', 'content-length': '182'}, 'body': {'prompt': 'Test connection', 'model': 'vllm-model-group-1', 'stream': True, 'stream_options': {'include_usage': True}, 'max_tokens': 1, 'max_completion_tokens': 1, 'stop': None, 'ignore_eos': True}}, metadata={'requester_metadata': {}, 'user_api_key_hash': None, 'user_api_key_alias': None, 'user_api_key_team_id': None, 'user_api_key_user_id': None, 'user_api_key_org_id': None, 'user_api_key_team_alias': None, 'user_api_key_end_user_id': None, 'user_api_key_user_email': None, 'user_api_key_request_route': '/v1/completions', 'user_api_key': None, 'user_api_end_user_max_budget': None, 'litellm_api_version': '1.76.3', 'global_max_parallel_requests': None, 'user_api_key_team_max_budget': None, 'user_api_key_team_spend': None, 'user_api_key_spend': 0.0, 'user_api_key_max_budget': None, 'user_api_key_model_max_budget': {}, 'user_api_key_metadata': {}, 'headers': {'host': 'localhost:4000', 'accept': '*/*', 'accept-encoding': 'gzip, deflate', 'connection': 'keep-alive', 'user-agent': 'python-httpx/0.28.1', 'content-type': 'application/json', 'content-length': '182'}, 'endpoint': 'http://localhost:4000/v1/completions', 'litellm_parent_otel_span': None, 'requester_ip_address': '', 'model_group': 'vllm-model-group-1', 'model_group_alias': None, 'model_group_size': 1, 'deployment': 'hosted_vllm/Qwen2-0.5B-Instruct', 'model_info': {'id': 'ed57857e2678045993539feb1c4691baa14c6f94ac3a814c0e52f9daa89fca2f', 'db_model': False}, 'api_base': 'http://10.233.108.232:8001/v1', 'deployment_model_name': 'vllm-model-group-1', 'caching_groups': None}, secret_fields={'raw_headers': {'host': 'localhost:4000', 'accept': '*/*', 'accept-encoding': 'gzip, deflate', 'connection': 'keep-alive', 'user-agent': 'python-httpx/0.28.1', 'content-type': 'application/json', 'content-length': '182'}}, litellm_call_id='1d367784-6898-40bd-84c4-05bbcee96966', litellm_logging_obj=<litellm.litellm_core_utils.litellm_logging.Logging object at 0x7fd70dac3170>, litellm_trace_id='886c574a-5605-4e70-aa35-56c0f9e4d4cf', model_info={'id': 'ed57857e2678045993539feb1c4691baa14c6f94ac3a814c0e52f9daa89fca2f', 'db_model': False}, timeout=120, max_retries=0)
19:46:37 - LiteLLM:DEBUG: utils.py:350 - 

19:46:37 - LiteLLM:DEBUG: utils.py:350 - ASYNC kwargs[caching]: False; litellm.cache: None; kwargs.get('cache'): None
19:46:37 - LiteLLM:DEBUG: caching_handler.py:234 - CACHE RESULT: None
19:46:37 - LiteLLM:INFO: utils.py:3347 - 
LiteLLM completion() model= Qwen2-0.5B-Instruct; provider = hosted_vllm
19:46:37 - LiteLLM:DEBUG: utils.py:3350 - 
LiteLLM: Params passed to completion() {'model': 'Qwen2-0.5B-Instruct', 'functions': None, 'function_call': None, 'temperature': None, 'top_p': None, 'n': None, 'stream': True, 'stream_options': {'include_usage': True}, 'stop': None, 'max_tokens': 1, 'max_completion_tokens': 1, 'modalities': None, 'prediction': None, 'audio': None, 'presence_penalty': None, 'frequency_penalty': None, 'logit_bias': None, 'user': None, 'custom_llm_provider': 'hosted_vllm', 'response_format': None, 'seed': None, 'tools': None, 'tool_choice': None, 'max_retries': 0, 'logprobs': None, 'top_logprobs': None, 'extra_headers': None, 'api_version': None, 'parallel_tool_calls': None, 'drop_params': None, 'allowed_openai_params': None, 'reasoning_effort': None, 'additional_drop_params': None, 'messages': [{'role': 'user', 'content': 'Test connection'}], 'thinking': None, 'web_search_options': None, 'safety_identifier': None, 'ignore_eos': True}
19:46:37 - LiteLLM:DEBUG: utils.py:3353 - 
LiteLLM: Non-Default params passed to completion() {'stream': True, 'stream_options': {'include_usage': True}, 'max_tokens': 1, 'max_completion_tokens': 1, 'max_retries': 0}
19:46:37 - LiteLLM:DEBUG: utils.py:350 - Final returned optional params: {'stream': True, 'stream_options': {'include_usage': True}, 'max_tokens': 1, 'max_completion_tokens': 1, 'max_retries': 0, 'extra_body': {'ignore_eos': True}}
19:46:37 - LiteLLM:DEBUG: litellm_logging.py:474 - self.optional_params: {'stream': True, 'stream_options': {'include_usage': True}, 'max_tokens': 1, 'max_completion_tokens': 1, 'max_retries': 0, 'extra_body': {'ignore_eos': True}}
19:46:37 - LiteLLM:DEBUG: utils.py:4730 - checking potential_model_names in litellm.model_cost: {'split_model': 'Qwen2-0.5B-Instruct', 'combined_model_name': 'hosted_vllm/Qwen2-0.5B-Instruct', 'stripped_model_name': 'Qwen2-0.5B-Instruct', 'combined_stripped_model_name': 'hosted_vllm/Qwen2-0.5B-Instruct', 'custom_llm_provider': 'hosted_vllm'}
19:46:37 - LiteLLM:DEBUG: utils.py:4828 - model=Qwen2-0.5B-Instruct, custom_llm_provider=hosted_vllm has no input_cost_per_token in model_cost_map. Defaulting to 0.
19:46:37 - LiteLLM:DEBUG: utils.py:4840 - model=Qwen2-0.5B-Instruct, custom_llm_provider=hosted_vllm has no output_cost_per_token in model_cost_map. Defaulting to 0.
19:46:37 - LiteLLM:DEBUG: litellm_logging.py:929 - 

POST Request Sent from LiteLLM:
curl -X POST \
http://10.233.108.232:8001/v1 \
-H 'content-type: application/json' -H 'Authorization: Be****ey' \
-d '{'model': 'Qwen2-0.5B-Instruct', 'prompt': 'Test connection', 'stream': True, 'stream_options': {'include_usage': True}, 'max_tokens': 1, 'max_completion_tokens': 1, 'extra_body': {'ignore_eos': True}}'


19:46:37 - LiteLLM:DEBUG: utils.py:350 - RAW RESPONSE:
<async_generator object OpenAITextCompletion.async_streaming at 0x7fd70e72b920>


19:46:37 - LiteLLM:DEBUG: cost_calculator.py:672 - selected model name for cost calculation: hosted_vllm/Qwen2-0.5B-Instruct
19:46:37 - LiteLLM:DEBUG: token_counter.py:374 - messages in token_counter: None, text in token_counter: 
19:46:37 - LiteLLM:DEBUG: utils.py:4730 - checking potential_model_names in litellm.model_cost: {'split_model': 'Qwen2-0.5B-Instruct', 'combined_model_name': 'hosted_vllm/Qwen2-0.5B-Instruct', 'stripped_model_name': 'Qwen2-0.5B-Instruct', 'combined_stripped_model_name': 'hosted_vllm/Qwen2-0.5B-Instruct', 'custom_llm_provider': 'hosted_vllm'}
19:46:37 - LiteLLM:DEBUG: utils.py:4828 - model=hosted_vllm/Qwen2-0.5B-Instruct, custom_llm_provider=hosted_vllm has no input_cost_per_token in model_cost_map. Defaulting to 0.
19:46:37 - LiteLLM:DEBUG: utils.py:4840 - model=hosted_vllm/Qwen2-0.5B-Instruct, custom_llm_provider=hosted_vllm has no output_cost_per_token in model_cost_map. Defaulting to 0.
19:46:37 - LiteLLM:DEBUG: cost_calculator.py:391 - Returned custom cost for model=hosted_vllm/Qwen2-0.5B-Instruct - prompt_tokens_cost_usd_dollar: 0, completion_tokens_cost_usd_dollar: 0
19:46:37 - LiteLLM:DEBUG: litellm_logging.py:1230 - response_cost: 0.0
19:46:37 - LiteLLM Router:INFO: router.py:2564 - litellm.atext_completion(model=hosted_vllm/Qwen2-0.5B-Instruct) 200 OK
19:46:37 - LiteLLM Router:DEBUG: router.py:3888 - Async Response: <litellm.utils.TextCompletionStreamWrapper object at 0x7fd70e1cf890>
19:46:37 - LiteLLM Proxy:DEBUG: proxy_server.py:3433 - inside generator
19:46:37 - LiteLLM:DEBUG: main.py:5818 - received response in _async_streaming: <async_generator object OpenAITextCompletion.async_streaming at 0x7fd70e72b920>
19:46:38 - LiteLLM:DEBUG: exception_mapping_utils.py:2330 - Logging Details: logger_fn - None | callable(logger_fn) - False
19:46:38 - LiteLLM Proxy:ERROR: proxy_server.py:3475 - litellm.proxy.proxy_server.async_data_generator(): Exception occured - litellm.APIConnectionError: APIConnectionError: OpenAIException - AsyncCompletions.create() got an unexpected keyword argument 'max_completion_tokens'
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/litellm/main.py", line 581, in _async_streaming
    async for line in response:
  File "/usr/local/lib/python3.12/dist-packages/litellm/llms/openai/completion/handler.py", line 296, in async_streaming
    raw_response = await openai_client.completions.with_raw_response.create(**data)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_legacy_response.py", line 381, in wrapped
    return cast(LegacyAPIResponse[R], await func(*args, **kwargs))
                                            ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_utils/_utils.py", line 286, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
TypeError: AsyncCompletions.create() got an unexpected keyword argument 'max_completion_tokens'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/litellm/proxy/proxy_server.py", line 3437, in async_data_generator
    async for chunk in proxy_logging_obj.async_post_call_streaming_iterator_hook(
  File "/usr/local/lib/python3.12/dist-packages/litellm/integrations/custom_logger.py", line 347, in async_post_call_streaming_iterator_hook
    async for item in response:
  File "/usr/local/lib/python3.12/dist-packages/litellm/integrations/custom_logger.py", line 347, in async_post_call_streaming_iterator_hook
    async for item in response:
  File "/usr/local/lib/python3.12/dist-packages/litellm/integrations/custom_logger.py", line 347, in async_post_call_streaming_iterator_hook
    async for item in response:
  [Previous line repeated 2 more times]
  File "/usr/local/lib/python3.12/dist-packages/litellm/utils.py", line 5929, in __anext__
    async for chunk in self.completion_stream:
  File "/usr/local/lib/python3.12/dist-packages/litellm/main.py", line 586, in _async_streaming
    raise exception_type(
          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/litellm/litellm_core_utils/exception_mapping_utils.py", line 2306, in exception_type
    raise e  # it's already mapped
    ^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/litellm/litellm_core_utils/exception_mapping_utils.py", line 539, in exception_type
    raise APIConnectionError(
litellm.exceptions.APIConnectionError: litellm.APIConnectionError: APIConnectionError: OpenAIException - AsyncCompletions.create() got an unexpected keyword argument 'max_completion_tokens'
19:46:38 - LiteLLM Proxy:DEBUG: proxy_server.py:3485 - An error occurred: litellm.APIConnectionError: APIConnectionError: OpenAIException - AsyncCompletions.create() got an unexpected keyword argument 'max_completion_tokens'

 Debug this by setting `--debug`, e.g. `litellm --model gpt-3.5-turbo --debug`
19:46:38 - LiteLLM Proxy:DEBUG: common_request_processing.py:126 - Error detected in first stream chunk. Status code set to: 500
INFO:     127.0.0.1:33700 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
19:48:09 - LiteLLM Proxy:DEBUG: hanging_request_check.py:148 - Checking for hanging requests....
19:48:10 - LiteLLM Proxy:DEBUG: hanging_request_check.py:148 - Checking for hanging requests....
19:48:10 - LiteLLM Proxy:DEBUG: hanging_request_check.py:148 - Checking for hanging requests....
19:48:10 - LiteLLM Proxy:DEBUG: hanging_request_check.py:148 - Checking for hanging requests....

0 replies

liyuerich · 2025-09-10T03:10:42Z

liyuerich
Sep 10, 2025
Author

not sure where "max_completion_tokens" comes from? all the request setup , there is no "max_completion_tokens" .

0 replies

tukwila · 2025-09-10T10:09:11Z

tukwila
Sep 10, 2025

I dig into this failure then put out my analysis：
firstly，this is not guidellm fault which means '500 Internal Server Error' was throught out by litellm. so why?

from this log:

25-09-09 01:09:14|ERROR            |guidellm.backend.openai:text_completions:306 - OpenAIHTTPBackend request with headers: {'Content-Type': 'application/json'} and params: {} and payload: {'prompt': 'Test connection', 'model': 'vllm-model-group-1', 'stream': True, 'stream_options': {'include_usage': True}, 'max_tokens': 1, 'max_completion_tokens': 1, 'stop': None, 'ignore_eos': True} failed: Server error '500 Internal Server Error' for url 'http://localhost:4000/v1/completions'

we know that guidellm do connection test to vllm backend service, refer to the code [https://github.com/vllm-project/guidellm/blob/main/src/guidellm/backend/backend.py#L144]

and i tested this request in postman, it pass using another model:

let's look at ltellm log:

  File "/usr/local/lib/python3.12/dist-packages/openai/_utils/_utils.py", line 286, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
TypeError: AsyncCompletions.create() got an unexpected keyword argument 'max_completion_tokens'

actually, openai support max_completion_tokens that i asked chatgpt:

so, according to chatgpt's hints, please check:

check if your hosted_vllm/Qwen2-0.5B-Instruct support this request:

{
    "prompt": "Test connection",
    "model": "public/qwen2.5-72b-awq",
    "stream": true,
    "stream_options": {
        "include_usage": true
    },
    "max_tokens": 1,
    "max_completion_tokens": 1,
    "stop": null,
    "ignore_eos": true
}

check if litellm support max_completion_tokens

1 reply

liyuerich Sep 11, 2025
Author

thanks @tukwila @sjmonson @mubashir1osmani it seems the model is ok for both parameters, if call vllm service directly

curl http://localhost:8001/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "Qwen2-0.5B-Instruct",
"messages": [
{"role": "user", "content": "San Francisco is a"}
],
"max_tokens": 7,
"temperature": 0
}'
{"id":"chatcmpl-684480ad38404ec0bca96f2b1a0fe31e","object":"chat.completion","created":1757562096,"model":"Qwen2-0.5B-Instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"large city in the United States,","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":23,"total_tokens":30,"completion_tokens":7,"prompt_tokens_details":null},"prompt_logprobs":null}root@in

curl http://localhost:8001/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "Qwen2-0.5B-Instruct",
"messages": [
{"role": "user", "content": "San Francisco is a"}
],
"max_completion_tokens": 7,
"temperature": 0
}'
{"id":"chatcmpl-878bc9fd38134f2db43d3be051cbb1f8","object":"chat.completion","created":1757562111,"model":"Qwen2-0.5B-Instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"large city in the United States,","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":23,"total_tokens":30,"completion_tokens":7,"prompt_tokens_details":null},"prompt_logprobs":null}root@in

mubashir1osmani · 2025-09-10T13:50:06Z

mubashir1osmani
Sep 10, 2025

litellm supports max_completion_tokens - try adding litellm.drop_params=True

ref: https://docs.litellm.ai/docs/completion/input

0 replies

sjmonson · 2025-09-10T18:12:30Z

sjmonson
Sep 10, 2025
Maintainer

Yeah due to this mess we have to emit both max_completion_tokens and max_tokens. Most model servers are smart enough to ignore one or the other.

liyuerich
Sep 11, 2025
Author

curl to litellm with different parameter, get error:

curl http://localhost:4000/v1/completions -H "Content-Type: application/json" -d '{"model": "vllm-model-group-1", "prompt": "Test connection", "max_tokens": 2}'
{"id":"cmpl-7586bbccae024c6cad8cc8b828ac20df","object":"text_completion","created":1757563061,"model":"Qwen2-0.5B-Instruct","choices":[{"stop_reason":null,"prompt_logprobs":null,"finish_reason":"length","index":0,"text":" between two","logprobs":null}],"usage":{"completion_tokens":2,"prompt_tokens":2,"total_tokens":4,"completion_tokens_details":null,"prompt_tokens_details":null},"system_fingerprint":null}root@ins-x2hk9-5cd49f4855-h2fmv:~/data

curl http://localhost:4000/v1/completions -H "Content-Type: application/json" -d '{"model": "vllm-model-group-1", "prompt": "Test connection", "max_completion_tokens": 2}'
{"error":{"message":"litellm.InternalServerError: InternalServerError: OpenAIException - AsyncCompletions.create() got an unexpected keyword argument 'max_completion_tokens'. Received Model Group=vllm-model-group-1\nAvailable Model Group Fallbacks=None","type":null,"param":null,"code":"500"}}root@ins-x2hk9-5cd49f4855-h2fmv:~/data#

0 replies

liyuerich · 2025-09-11T04:59:40Z

liyuerich
Sep 11, 2025
Author

refer this doc to setup config.yaml to ignore the parameters, still get the same error.

https://docs.litellm.ai/docs/completion/drop_params

0 replies

use guidellm to test vllm benchmark (with litellm proxy), failed #312

Uh oh!

liyuerich Sep 9, 2025

Replies: 9 comments · 1 reply

Uh oh!

liyuerich Sep 9, 2025 Author

Uh oh!

sjmonson Sep 9, 2025 Maintainer

Uh oh!

liyuerich Sep 10, 2025 Author

Uh oh!

liyuerich Sep 10, 2025 Author

Uh oh!

Uh oh!

tukwila Sep 10, 2025

Uh oh!

Uh oh!

liyuerich Sep 11, 2025 Author

Uh oh!

mubashir1osmani Sep 10, 2025

Uh oh!

sjmonson Sep 10, 2025 Maintainer

Uh oh!

liyuerich Sep 11, 2025 Author

Uh oh!

liyuerich Sep 11, 2025 Author

liyuerich
Sep 9, 2025

Replies: 9 comments 1 reply

liyuerich
Sep 9, 2025
Author

sjmonson
Sep 9, 2025
Maintainer

liyuerich
Sep 10, 2025
Author

liyuerich
Sep 10, 2025
Author

tukwila
Sep 10, 2025

liyuerich Sep 11, 2025
Author

mubashir1osmani
Sep 10, 2025

sjmonson
Sep 10, 2025
Maintainer

liyuerich
Sep 11, 2025
Author

liyuerich
Sep 11, 2025
Author