Skip to content

test: intermittent failures from vllm tests on LSF cluster #699

@planetf1

Description

@planetf1

I'm seeing intermittent failures from vllm tests on lsf cluster when run with

uv run --all-extras --all-groups pytest --isolate-heavy -v

For example:

==== 723 passed, 142 skipped, 2 xfailed, 90 warnings in 1572.83s (0:26:12) =====

when all worked well, and

FAILED test/backends/test_openai_vllm.py::test_instruct - openai.NotFoundErro...
FAILED test/backends/test_openai_vllm.py::test_multiturn - openai.NotFoundErr...
FAILED test/backends/test_openai_vllm.py::test_chat - openai.NotFoundError: E...
FAILED test/backends/test_openai_vllm.py::test_chat_stream - openai.NotFoundE...
FAILED test/backends/test_openai_vllm.py::test_format - openai.NotFoundError:...
FAILED test/backends/test_openai_vllm.py::test_generate_from_raw - openai.Not...
FAILED test/backends/test_openai_vllm.py::test_generate_from_raw_with_format
= 7 failed, 716 passed, 142 skipped, 2 xfailed, 90 warnings in 1409.38s (0:23:29) =

at other times.

Success seems about 50-75% failure from running multiple times

On further investigation the underlying error for all these cases is:

E               openai.NotFoundError: Error code: 404 - {'error': {'message': 'The model `ibm-granite/granite-4.0-micro` does not exist.', 'type': 'NotFoundError', 'param': None, 'code': 404}}

Question to persue -- How is the vllm server initialized when tests are run with uv on a GPU enabled cluster - clearly sometimes we get access to a vllm environment with the right model, othertimes we don't

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions