Demonstrate Continual Learning using OpenClaw-RL

# Continual Learning using OpenClaw-RL and OpenRL

## Objective
Demonstrate continuous reinforcement learning from user interaction. The goal is to show that the model can learn and adapt to feedback in real-time while being used, turning natural conversation into training signals.

## Self-Judge Model Selection and Limitations
The self-judge approach is chosen to enable reinforcement learning within resource-constrained environments, but it introduces specific limitations for this OpenRL implementation:

1. **LoRA Restriction:** The system relies on LoRA (Low-Rank Adaptation) to share base model weights between the policy and judge roles. This means full-parameter fine-tuning is not supported, which may limit the depth of learning for highly complex tasks.
2. **Self-Evaluation Bias:** Using the same base model family for both generation and judgment can lead to sycophancy. The model may be less effective at identifying its own subtle errors compared to using a larger, independent judge model.
3. **Infrastructure Dependency:** While it avoids loading full duplicate models, it relies on the self-hosted Tinker setup to efficiently manage multiple client sessions and LoRA weights, which can still be a bottleneck on limited hardware.

## Prerequisites
1. Self-hosted Tinker infrastructure running and accessible.
2. Tinker API Key configured.
3. Client application capable of communicating with an OpenAI-compatible API.

## Tinker Host Configuration
Since you are using a self-hosted implementation of Tinker, you will need to configure the client to point to your specific host. This is typically done by setting an environment variable expected by the Tinker SDK (such as `TINKER_HOST` or `TINKER_BASE_URL`), in addition to the `TINKER_API_KEY`. Please consult your self-hosted Tinker documentation for the exact variable name required.

## Execution Steps
*Assumption: The `openclaw-rl` repository is checked out and you are in the root directory.*

1. Navigate to the Tinker adapter directory:
```bash
cd openclaw-tinker
```

2. Set the Tinker API Key:
```bash
export TINKER_API_KEY="your-tinker-api-key"
```

3. Launch the server:
Run the script using the combined method. By omitting the --teacher-model-name flag in this command, the system automatically defaults to using the same model family for both roles.
```bash
python run.py --method combine --model-name Qwen/Qwen3-8B --batch-size 16
```
This starts a local OpenAI-compatible proxy server at http://localhost:30000/v1.

4. Configure the Client:
Point the client frontend to the server URL: http://localhost:30000/v1.

5. Interaction:
Initiate interaction with the model. The system will collect feedback and perform training in the background on the self-hosted Tinker setup.

## Ideas to Try
Here are some ways to extend this implementation:

1. **Leverage Built-in Context (OPD):** The On-Policy Distillation method automatically generates "hindsight hints" from the outcome of an action and uses them as additional context for evaluation.
2. **Custom Prompt Context:** To add a custom rubric or guidelines for evaluation, you can modify `openclaw-tinker/scorers.py` to inject custom instructions into the prompt template before it calls `tokenizer.apply_chat_template`.
3. **Custom Reward Models:** The core framework supports a `--custom-rm-path` flag (when run via the local `slime` directory) allowing you to plug in custom reward logic with arbitrary external context.
4. **External Reward Model via API:** To completely offload the reward computation and avoid local VRAM constraints, you can modify `openclaw-tinker/scorers.py` to call an external model API (like OpenAI or Fireworks) for judging and scoring instead of relying on the Tinker sampling client.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Demonstrate Continual Learning using OpenClaw-RL #91

Continual Learning using OpenClaw-RL and OpenRL

Objective

Self-Judge Model Selection and Limitations

Prerequisites

Tinker Host Configuration

Execution Steps

Ideas to Try

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Demonstrate Continual Learning using OpenClaw-RL #91

Description

Continual Learning using OpenClaw-RL and OpenRL

Objective

Self-Judge Model Selection and Limitations

Prerequisites

Tinker Host Configuration

Execution Steps

Ideas to Try

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions