You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Long running requests do not work on the RunPod platform using the vLLM worker.
To Reproduce
Deploy a model and try to generate long outputs. Once you pass the 60 second mark, it always stops even though the request is still running and generating more data.
# Call the vLLM server using the AsyncOpenAI client
response = await client.chat.completions.create(
model="<model_id>",
messages=conversation,
max_tokens=4096*8,
temperature=0.0,
stream=True,
)
start_time = time.time()
async for partial in response:
print(partial.choices[0].delta.content, end="")
print(f"\n\nelapsed: {time.time() - start_time}")
Expected behavior
The stream should not be stopping after 60 seconds.