Skip to content

Major bug: Request always returns after 60 seconds even though it's not done generating #396

@casper-hansen

Description

@casper-hansen

Describe the bug
Long running requests do not work on the RunPod platform using the vLLM worker.

To Reproduce
Deploy a model and try to generate long outputs. Once you pass the 60 second mark, it always stops even though the request is still running and generating more data.

    # Call the vLLM server using the AsyncOpenAI client
    response = await client.chat.completions.create(
        model="<model_id>",
        messages=conversation,
        max_tokens=4096*8,
        temperature=0.0,
        stream=True,
    )

    start_time = time.time()

    async for partial in response:
        print(partial.choices[0].delta.content, end="")

    print(f"\n\nelapsed: {time.time() - start_time}")

Expected behavior
The stream should not be stopping after 60 seconds.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions