Major bug: Request always returns after 60 seconds even though it's not done generating

**Describe the bug**
Long running requests do not work on the RunPod platform using the vLLM worker.

**To Reproduce**
Deploy a model and try to generate long outputs. Once you pass the 60 second mark, it always stops even though the request is still running and generating more data.

```
    # Call the vLLM server using the AsyncOpenAI client
    response = await client.chat.completions.create(
        model="<model_id>",
        messages=conversation,
        max_tokens=4096*8,
        temperature=0.0,
        stream=True,
    )

    start_time = time.time()

    async for partial in response:
        print(partial.choices[0].delta.content, end="")

    print(f"\n\nelapsed: {time.time() - start_time}")
```

**Expected behavior**
The stream should not be stopping after 60 seconds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Major bug: Request always returns after 60 seconds even though it's not done generating #396

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Major bug: Request always returns after 60 seconds even though it's not done generating #396

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions