Skip to content

Bug: interactive mode can appear hung on first prompt, then exit after cleanup with misleading Goodbye #54

@zhaojava42-dot

Description

@zhaojava42-dot

Summary

In standalone interactive mode (openspace without --query), the first user request can appear to hang for a long time with no visible output. In some runs, the process then exits after cleanup and prints Goodbye!, which looks like an unexpected shutdown rather than a handled task failure.

This is not a single root cause. There are two issues layered together:

  1. The first-request path performs LLM-based skill selection before normal task execution, and this step can be cancelled or take too long in some environments.
  2. When that early phase fails, tool_layer.execute() can throw a secondary exception during cleanup/final result assembly, which hides the original failure.

Environment

  • Project: OpenSpace
  • Mode: standalone CLI, interactive mode
  • Launch command:
openspace
  • User then enters a first prompt such as:
如何使用 opensapce
  • Local model config in openspace/.env used an OpenAI-compatible Zhipu endpoint:
OPENSPACE_MODEL=openai/glm-5.1
OPENSPACE_LLM_API_BASE=https://open.bigmodel.cn/api/coding/paas/v4
OPENSPACE_LLM_API_KEY=...
OPENSPACE_WORKSPACE=/Users/javazhao/OpenSpace
OPENSPACE_BACKEND_SCOPE=shell,mcp,system

Expected behavior

  • The first prompt should either:
    • respond normally, or
    • fail fast with a clear error message
  • The CLI should not appear stuck with no feedback for an extended period
  • Cleanup should not mask the original exception

Actual behavior

  • After entering the first prompt, the CLI can sit in a running state for a long time with no visible answer
  • In failure cases, the original error is obscured by a secondary exception in cleanup/finalization
  • The final Goodbye! message makes the shutdown look intentional, even though the task path failed

Relevant logs

Example failure sequence from interactive mode:

  • skill selection starts before task execution
  • LLM request stalls/cancels
  • cleanup path throws secondary exception

Relevant stack trace:

asyncio.exceptions.CancelledError
...
File "openspace/skill_engine/registry.py", line 466, in select_skills_with_llm
    resp = await llm_client.complete(prompt, **llm_kwargs)
...
File "openspace/tool_layer.py", line 584, in execute
    "execution_time": execution_time,
UnboundLocalError: cannot access local variable 'execution_time' where it is not associated with a value

Relevant log behavior:

UI monitoring started
Task: 如何使用 opensapce...
...
Error: cannot access local variable 'execution_time' where it is not associated with a value

Root cause analysis

1. First-request skill selection is on the critical path

OpenSpace.execute() runs _select_and_inject_skills() before the main task execution path when a skill registry is present.

That means the very first user prompt blocks on LLM-based skill selection before the user sees a real answer.

Relevant code:

  • openspace/tool_layer.py
  • openspace/skill_engine/registry.py

2. Cancellation/timeout in skill selection is not handled explicitly

select_skills_with_llm() catches Exception, but the failure observed here is asyncio.CancelledError.

In practice, this can escape the normal "selection failed, proceed without skills" path depending on runtime behavior.

Relevant code:

resp = await llm_client.complete(prompt, **llm_kwargs)

3. Cleanup/final result path can hide the original error

In tool_layer.execute(), execution_time and result are used later in the function. If an exception is raised early enough, the cleanup/final result path can raise a second exception, which hides the original cause from the user.

Observed secondary exception:

UnboundLocalError: cannot access local variable 'execution_time'

Why this is confusing for users

  • The terminal appears hung because the first visible work is a hidden skill-selection LLM call
  • The user gets little or no progress feedback in interactive mode
  • The final Goodbye! output suggests a normal exit, even though the real issue happened earlier

Suggested fixes

Fix 1: make skill selection fail fast

Use a dedicated short-timeout LLM client for skill selection instead of reusing the main task LLM client.

This keeps the first prompt responsive.

Fix 2: explicitly handle asyncio.CancelledError

Treat cancellation/timeout in skill selection as:

  • log warning
  • skip skills
  • continue with tool-only execution

Fix 3: stabilize cleanup/final result assembly

Initialize result and execution_time before the main try block so cleanup never throws a secondary exception.

Fix 4: improve interactive-mode feedback

Before skill selection starts, print a short status such as:

Selecting relevant skills...

If selection fails:

Skill selection failed, continuing without skills.

Minimal reproduction

  1. Configure standalone OpenSpace with a real LLM endpoint in openspace/.env
  2. Launch:
openspace
  1. Enter:
如何使用 opensapce
  1. Observe long no-output period on the first prompt
  2. In failure runs, observe cleanup exit and secondary exception instead of a clear task failure

Additional note

This issue is distinct from provider/model routing issues.

For example, a separate configuration issue exists when using a bare model name like glm-5.1 with LiteLLM against an OpenAI-compatible endpoint. That problem is about provider-qualified model naming. This issue is about interactive-mode robustness and error handling after startup succeeds.

Affected files

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions