This project is a customized version of verl (Volcano Engine Reinforcement Learning) for training large language models on IoT device skill tasks. The main modification is a custom reward function designed specifically for evaluating IoT device operation outputs.
This project focuses on training LLMs to understand and generate IoT device control specifications. The model learns to convert natural language queries into structured device control instructions, including device identification, action specification, and parameter configuration.
The primary modification is the custom reward function located at verl/utils/reward_score/iot_skill_reward.py. This reward function evaluates model outputs based on:
-
Format Validation
- Think Mode: Validates the presence and correctness of
<think>and<instruction>tags - Non-Think Mode: Ensures the output does not contain thinking tags
- Checks for proper JSON structure and bracket symmetry
- Think Mode: Validates the presence and correctness of
-
Content Correctness
- Validates JSON format of the instruction content
- Compares device IDs, spec IDs, and values between predicted and ground truth outputs
- Provides granular scoring based on partial matches (e.g., correct device_id but wrong spec_id)
-
Consistency Checking
- Verifies consistency between thinking process and action output (device ID matching)
- Ensures the reasoning process aligns with the final instruction
The reward function supports two modes:
-
Think Mode (
enhanced_think_rewards):- Format reward: up to 2.0 points for complete format with both thinking and instruction
- Correctness reward: up to 1.5 points for exact match, 1.0 for correct device_id, 0.3 for correct spec_id/value
- Consistency reward: +0.5 points for think-action consistency
-
Non-Think Mode (
enhanced_non_think_rewards):- Format reward: 3.0 points for correct format (no thinking tags)
- Bracket symmetry check: +0.5/-0.5 points
- Correctness reward: same as think mode
The project includes training data for various IoT device categories:
- Air Conditioner (空调)
- Air Purifier (空气净化器)
- Air Quality Monitor (空气检测仪)
- Camera (摄像头)
- Dehumidifier (除湿机)
- Dishwasher (洗碗机)
- Fan (风扇)
- Humidifier (加湿器)
- Light (灯)
- Projector (投影仪)
- Robot Vacuum (扫地机器人)
- Smart Mirror (智能镜)
- Switch (开关)
- TV (电视)
- Washing Machine (洗衣机)
- And many more...
Data files are located in the data/ directory, organized by device category.
This project is based on verl. Please refer to the verl documentation for base installation instructions.
Install the required dependencies:
pip install -r requirements.txtFor CUDA support:
pip install -r requirements-cuda.txtFor NPU support:
pip install -r requirements-npu.txtThe reward function is automatically integrated into verl's training pipeline. When using data_source="skill" in your configuration, the custom reward function will be used.
data:
reward_fn_key: "data_source"
data_source: "skill"
reward:
custom_reward_function:
path: "verl/utils/reward_score/iot_skill_reward.py"
name: "compute_score"from verl.utils.reward_score.iot_skill_reward import compute_score
# Think mode
reward = compute_score(
solution_str="<think>...</think><instruction>...</instruction>",
ground_truth="<instruction>...</instruction>",
extra_info={"think": "think"}
)
# Non-think mode
reward = compute_score(
solution_str="[{\"actions\": [...]}]",
ground_truth="[{\"actions\": [...]}]",
extra_info={}
)iot_spec_llm/
├── data/ # Training data for various IoT devices
├── verl/ # Modified verl framework
│ └── utils/
│ └── reward_score/
│ └── iot_skill_reward.py # Custom reward function
├── recipe/ # Training recipes (from verl)
└── requirements*.txt # Dependencies
Compared to the original verl project, this repository includes:
-
Custom Reward Function (
verl/utils/reward_score/iot_skill_reward.py):- Format validation for IoT device control outputs
- Structured content comparison (device_id, spec_id, value)
- Think-action consistency checking
- Detailed error logging and reporting
-
IoT Device Data: Training datasets for various smart home devices
All other components remain largely unchanged from the original verl framework.
This project inherits the Apache 2.0 license from verl. See the original verl repository for license details.
This project is based on verl by Bytedance - Seed - MLSys. We thank the verl team for their excellent work on reinforcement learning for LLMs.
If you use this project in your research, please cite our paper:
@article{micu,
title={MiCU: End-to-End Smart Home Command Understanding with Large Language Model},
author={Han, Haowei and Hu, Kexin and Cai, Weiwei and Zhang, Debiao and Qin, Bin and Wang, Yuxiang and Jiang, Jiawei and Yan, Xiao and Du, Bo},
year={2026}
}