Skip to content

Demonstrate Continual Learning using OpenClaw-RL #91

Description

@droot

Continual Learning using OpenClaw-RL and OpenRL

Objective

Demonstrate continuous reinforcement learning from user interaction. The goal is to show that the model can learn and adapt to feedback in real-time while being used, turning natural conversation into training signals.

Self-Judge Model Selection and Limitations

The self-judge approach is chosen to enable reinforcement learning within resource-constrained environments, but it introduces specific limitations for this OpenRL implementation:

  1. LoRA Restriction: The system relies on LoRA (Low-Rank Adaptation) to share base model weights between the policy and judge roles. This means full-parameter fine-tuning is not supported, which may limit the depth of learning for highly complex tasks.
  2. Self-Evaluation Bias: Using the same base model family for both generation and judgment can lead to sycophancy. The model may be less effective at identifying its own subtle errors compared to using a larger, independent judge model.
  3. Infrastructure Dependency: While it avoids loading full duplicate models, it relies on the self-hosted Tinker setup to efficiently manage multiple client sessions and LoRA weights, which can still be a bottleneck on limited hardware.

Prerequisites

  1. Self-hosted Tinker infrastructure running and accessible.
  2. Tinker API Key configured.
  3. Client application capable of communicating with an OpenAI-compatible API.

Tinker Host Configuration

Since you are using a self-hosted implementation of Tinker, you will need to configure the client to point to your specific host. This is typically done by setting an environment variable expected by the Tinker SDK (such as TINKER_HOST or TINKER_BASE_URL), in addition to the TINKER_API_KEY. Please consult your self-hosted Tinker documentation for the exact variable name required.

Execution Steps

Assumption: The openclaw-rl repository is checked out and you are in the root directory.

  1. Navigate to the Tinker adapter directory:
cd openclaw-tinker
  1. Set the Tinker API Key:
export TINKER_API_KEY="your-tinker-api-key"
  1. Launch the server:
    Run the script using the combined method. By omitting the --teacher-model-name flag in this command, the system automatically defaults to using the same model family for both roles.
python run.py --method combine --model-name Qwen/Qwen3-8B --batch-size 16

This starts a local OpenAI-compatible proxy server at http://localhost:30000/v1.

  1. Configure the Client:
    Point the client frontend to the server URL: http://localhost:30000/v1.

  2. Interaction:
    Initiate interaction with the model. The system will collect feedback and perform training in the background on the self-hosted Tinker setup.

Ideas to Try

Here are some ways to extend this implementation:

  1. Leverage Built-in Context (OPD): The On-Policy Distillation method automatically generates "hindsight hints" from the outcome of an action and uses them as additional context for evaluation.
  2. Custom Prompt Context: To add a custom rubric or guidelines for evaluation, you can modify openclaw-tinker/scorers.py to inject custom instructions into the prompt template before it calls tokenizer.apply_chat_template.
  3. Custom Reward Models: The core framework supports a --custom-rm-path flag (when run via the local slime directory) allowing you to plug in custom reward logic with arbitrary external context.
  4. External Reward Model via API: To completely offload the reward computation and avoid local VRAM constraints, you can modify openclaw-tinker/scorers.py to call an external model API (like OpenAI or Fireworks) for judging and scoring instead of relying on the Tinker sampling client.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions