Continual Learning using OpenClaw-RL and OpenRL
Objective
Demonstrate continuous reinforcement learning from user interaction. The goal is to show that the model can learn and adapt to feedback in real-time while being used, turning natural conversation into training signals.
Self-Judge Model Selection and Limitations
The self-judge approach is chosen to enable reinforcement learning within resource-constrained environments, but it introduces specific limitations for this OpenRL implementation:
- LoRA Restriction: The system relies on LoRA (Low-Rank Adaptation) to share base model weights between the policy and judge roles. This means full-parameter fine-tuning is not supported, which may limit the depth of learning for highly complex tasks.
- Self-Evaluation Bias: Using the same base model family for both generation and judgment can lead to sycophancy. The model may be less effective at identifying its own subtle errors compared to using a larger, independent judge model.
- Infrastructure Dependency: While it avoids loading full duplicate models, it relies on the self-hosted Tinker setup to efficiently manage multiple client sessions and LoRA weights, which can still be a bottleneck on limited hardware.
Prerequisites
- Self-hosted Tinker infrastructure running and accessible.
- Tinker API Key configured.
- Client application capable of communicating with an OpenAI-compatible API.
Tinker Host Configuration
Since you are using a self-hosted implementation of Tinker, you will need to configure the client to point to your specific host. This is typically done by setting an environment variable expected by the Tinker SDK (such as TINKER_HOST or TINKER_BASE_URL), in addition to the TINKER_API_KEY. Please consult your self-hosted Tinker documentation for the exact variable name required.
Execution Steps
Assumption: The openclaw-rl repository is checked out and you are in the root directory.
- Navigate to the Tinker adapter directory:
- Set the Tinker API Key:
export TINKER_API_KEY="your-tinker-api-key"
- Launch the server:
Run the script using the combined method. By omitting the --teacher-model-name flag in this command, the system automatically defaults to using the same model family for both roles.
python run.py --method combine --model-name Qwen/Qwen3-8B --batch-size 16
This starts a local OpenAI-compatible proxy server at http://localhost:30000/v1.
-
Configure the Client:
Point the client frontend to the server URL: http://localhost:30000/v1.
-
Interaction:
Initiate interaction with the model. The system will collect feedback and perform training in the background on the self-hosted Tinker setup.
Ideas to Try
Here are some ways to extend this implementation:
- Leverage Built-in Context (OPD): The On-Policy Distillation method automatically generates "hindsight hints" from the outcome of an action and uses them as additional context for evaluation.
- Custom Prompt Context: To add a custom rubric or guidelines for evaluation, you can modify
openclaw-tinker/scorers.py to inject custom instructions into the prompt template before it calls tokenizer.apply_chat_template.
- Custom Reward Models: The core framework supports a
--custom-rm-path flag (when run via the local slime directory) allowing you to plug in custom reward logic with arbitrary external context.
- External Reward Model via API: To completely offload the reward computation and avoid local VRAM constraints, you can modify
openclaw-tinker/scorers.py to call an external model API (like OpenAI or Fireworks) for judging and scoring instead of relying on the Tinker sampling client.
Continual Learning using OpenClaw-RL and OpenRL
Objective
Demonstrate continuous reinforcement learning from user interaction. The goal is to show that the model can learn and adapt to feedback in real-time while being used, turning natural conversation into training signals.
Self-Judge Model Selection and Limitations
The self-judge approach is chosen to enable reinforcement learning within resource-constrained environments, but it introduces specific limitations for this OpenRL implementation:
Prerequisites
Tinker Host Configuration
Since you are using a self-hosted implementation of Tinker, you will need to configure the client to point to your specific host. This is typically done by setting an environment variable expected by the Tinker SDK (such as
TINKER_HOSTorTINKER_BASE_URL), in addition to theTINKER_API_KEY. Please consult your self-hosted Tinker documentation for the exact variable name required.Execution Steps
Assumption: The
openclaw-rlrepository is checked out and you are in the root directory.cd openclaw-tinkerRun the script using the combined method. By omitting the --teacher-model-name flag in this command, the system automatically defaults to using the same model family for both roles.
This starts a local OpenAI-compatible proxy server at http://localhost:30000/v1.
Configure the Client:
Point the client frontend to the server URL: http://localhost:30000/v1.
Interaction:
Initiate interaction with the model. The system will collect feedback and perform training in the background on the self-hosted Tinker setup.
Ideas to Try
Here are some ways to extend this implementation:
openclaw-tinker/scorers.pyto inject custom instructions into the prompt template before it callstokenizer.apply_chat_template.--custom-rm-pathflag (when run via the localslimedirectory) allowing you to plug in custom reward logic with arbitrary external context.openclaw-tinker/scorers.pyto call an external model API (like OpenAI or Fireworks) for judging and scoring instead of relying on the Tinker sampling client.