-
Notifications
You must be signed in to change notification settings - Fork 617
Added the README and script files for training sql_agent on NPU #272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,96 @@ | ||
| # agent-lightning x Ascend | ||
|
|
||
| We have added support for **Huawei Ascend NPUs** in **agent-lightning**, and provided an example of training a SQL agent based on the **Spider dataset**. | ||
|
|
||
| ## Hardware Support | ||
|
|
||
| - Atlas 200T A2 Box16 | ||
| - Atlas 900 A2 PODc | ||
| - Atlas 800T A3 | ||
|
|
||
| At least **a single 40GB NPU** is required to run the Qwen2.5-Coder-1.5B-Instruct model. | ||
|
|
||
| ## Environment Setup | ||
|
|
||
| ### Basic Environment | ||
|
|
||
| - Python: 3.11.13 | ||
| - CANN: 8.2.RC1 | ||
| - torch: 2.7.1+cpu | ||
| - torch_npu: 2.7.1.dev20250724 | ||
|
|
||
| > For basic environment preparation, please refer to this [document](https://gitcode.com/Ascend/pytorch). | ||
| ### Configure Mirror Sources | ||
|
|
||
| Before installing dependencies, it is recommended to configure pip mirrors: | ||
|
|
||
| ``` | ||
| pip config set global.index-url http://repo.huaweicloud.com/repository/pypi/simple | ||
| pip config set global.extra-index-url "https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi" | ||
| # Mirrors: | ||
| # http://repo.huaweicloud.com/repository/pypi/simple | ||
| # https://download.pytorch.org/whl/cpu/ | ||
| # https://mirrors.huaweicloud.com/ascend/repos/pypi | ||
| ``` | ||
|
|
||
| ### Install vLLM & vLLM-Ascend | ||
|
|
||
| ``` | ||
| pip install vllm==0.10.0 --trusted-host repo.huaweicloud.com | ||
| pip install vllm-Ascend==0.10.0rc1 --trusted-host repo.huaweicloud.com | ||
| ``` | ||
|
|
||
| ### Install VERL | ||
|
|
||
| ``` | ||
| pip install verl==0.5.0 | ||
| ``` | ||
|
|
||
| > ⚠️ Note: To ensure the VERL framework runs correctly on NPU, add the following two lines to the file `verl/utils/vllm_utils.py`: | ||
| ``` | ||
| from vllm_ascend.patch import platform | ||
| from vllm_ascend.patch import worker | ||
| ``` | ||
|
|
||
| ### Install agent-lightning | ||
|
|
||
| ``` | ||
| pip install agentlightning==0.2.1 | ||
| ``` | ||
|
|
||
| ### Install Other Dependencies | ||
|
|
||
| ``` | ||
| pip install autogen-agentchat autogen-ext mcp | ||
| pip install langgraph "langchain[openai]" langchain-community langchain-text-splitters | ||
| pip install sqlparse nltk | ||
| ``` | ||
|
|
||
| ## Model | ||
|
|
||
| We use the [Qwen2.5-Coder-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B) model to train the SQL agent. Running requires at least **one 40GB NPU**. | ||
|
|
||
| ## Dataset | ||
|
|
||
| We use the Spider 1.0 dataset, which contains about 8,000 samples, including natural language questions, database schemas, and corresponding standard SQL queries. | ||
|
|
||
| Training requires the following three Parquet files: | ||
|
|
||
| - `train_spider.parquet` | ||
| - `test_dev_500.parquet` | ||
| - `test_dev.parquet` | ||
|
|
||
| ## Training Workflow | ||
|
|
||
| 1. **Prepare the dataset**: Convert the Spider dataset into Parquet format and place it in the `data/` directory. | ||
|
|
||
| 2. **Configure the environment**: Ensure vLLM-Ascend, VERL, and agent-lightning are correctly installed. | ||
|
|
||
| 3. **Start training**: Run the following command to begin training the SQL agent: | ||
|
|
||
| ``` | ||
| python train_sql_agent_npu.py qwen | ||
| ``` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -320,6 +320,92 @@ For the LLaMA profile, export an `HF_TOKEN` before running so VERL can download | |
| ```bash | ||
| env RAY_DEBUG=legacy HYDRA_FULL_ERROR=1 VLLM_USE_V1=1 ray start --head --dashboard-host=0.0.0.0 | ||
| ``` | ||
| ### Launch Training with NPUS | ||
|
|
||
| We have added support for **Huawei Ascend NPUs** in **agent-lightning**, and create a function `config_train_npu` in the script . | ||
|
|
||
| #### Hardware Support | ||
|
|
||
| - **Atlas 200T A2 Box16** | ||
| - **Atlas 900 A2 PODc** | ||
| - **Atlas 800T A3** | ||
|
|
||
| At least **a single 40GB NPU** is required to run the **Qwen2.5-Coder-1.5B-Instruct** model. | ||
|
|
||
| #### Environment Setup | ||
|
|
||
| ##### Basic Environment | ||
|
|
||
| - **Python:** 3.11.13 | ||
| - **CANN:** 8.2.RC1 | ||
| - **torch:** 2.7.1+cpu | ||
| - **torch_npu:** 2.7.1.dev20250724 | ||
|
|
||
| > For basic environment preparation, please refer to this [document](https://gitcode.com/Ascend/pytorch). | ||
|
|
||
| ##### Configure Mirror Sources | ||
|
|
||
| Before installing dependencies, configure the following pip mirrors: | ||
|
|
||
| ``` | ||
| pip config set global.index-url http://repo.huaweicloud.com/repository/pypi/simple | ||
| pip config set global.extra-index-url "https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi" | ||
| ``` | ||
|
|
||
| ##### Install vLLM & vLLM-Ascend | ||
|
|
||
| ``` | ||
| pip install vllm==0.10.0 --trusted-host repo.huaweicloud.com | ||
| pip install vllm-Ascend==0.10.0rc1 --trusted-host repo.huaweicloud.com | ||
| ``` | ||
|
|
||
| ##### Install VERL | ||
|
|
||
| ``` | ||
| pip install verl==0.5.0 | ||
| ``` | ||
|
|
||
| > ⚠️ **Note:** To ensure the VERL framework runs correctly on NPU, add the following lines to | ||
| > `verl/utils/vllm_utils.py`: | ||
|
|
||
| ``` | ||
| from vllm_ascend.patch import platform | ||
| from vllm_ascend.patch import worker | ||
| ``` | ||
|
|
||
| ##### Install Agent-Lightning | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. there are many duplicates with existing install guides. Please only mention the differences here to make the documentation clear and easy to read. |
||
|
|
||
| ``` | ||
| pip install agentlightning==0.2.1 | ||
| ``` | ||
|
|
||
| ##### Install Other Dependencies | ||
|
|
||
| ``` | ||
| pip install autogen-agentchat autogen-ext mcp | ||
| pip install langgraph "langchain[openai]" langchain-community langchain-text-splitters | ||
| pip install sqlparse nltk | ||
| ``` | ||
|
|
||
| #### Model | ||
|
|
||
| We use the [**Qwen2.5-Coder-1.5B-Instruct**](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B) model to train the SQL agent. | ||
|
|
||
| #### Dataset | ||
|
|
||
| Refer to the method above for obtaining the dataset. | ||
|
|
||
| #### Training Workflow | ||
|
|
||
| 1. **Prepare the dataset**: Convert the Spider dataset into Parquet format and place it in the `data/` directory. | ||
|
|
||
| 2. **Configure the environment**: Ensure vLLM-Ascend, VERL, and agent-lightning are correctly installed. | ||
|
|
||
| 3. **Start training**: Run the following command to begin training the SQL agent: | ||
|
|
||
| ``` | ||
| python train_sql_agent_npu.py npu | ||
| ``` | ||
|
|
||
| ### Debugging the Agent without VERL | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -132,11 +132,24 @@ def config_train_fast() -> Dict[str, Any]: | |
|
|
||
| def config_train_qwen() -> Dict[str, Any]: | ||
| """A configuration for training with Qwen-2.5B.""" | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please install pre-commit |
||
| config = deepcopy(RL_TRAINING_CONFIG) | ||
| return config | ||
|
|
||
|
|
||
| def config_train_npu() -> Dict[str, Any]: | ||
| """A configuration for training with NPU.""" | ||
|
|
||
| config = deepcopy(RL_TRAINING_CONFIG) | ||
| del config["actor_rollout_ref"]["rollout"]["engine_kwargs"]["vllm"]["enable_auto_tool_choice"] | ||
| del config["actor_rollout_ref"]["rollout"]["engine_kwargs"]["vllm"]["tool_call_parser"] | ||
| del config["trainer"]["logger"][1] | ||
| config["actor_rollout_ref"]["actor"]["use_torch_compile"] = False | ||
| config["trainer"]["val_before_train"] = False | ||
| config["trainer"]["save_freq"] = 256 | ||
| config["trainer"]["device"] = "npu" | ||
| return config | ||
|
|
||
| def config_train_llama() -> Dict[str, Any]: | ||
| """A configuration for training with LLaMA-3.2-1B-Instruct. | ||
|
|
||
|
|
@@ -171,21 +184,19 @@ def main() -> None: | |
|
|
||
| parser.add_argument( | ||
| "config", | ||
| choices=["fast", "qwen", "llama"], | ||
| help="Training configuration: 'fast' (CI testing), 'qwen' (Qwen-2.5-Coder-1.5B), 'llama' (LLaMA-3.2-3B)", | ||
| choices=["fast","qwen","llama", "npu"], | ||
| help="Training configuration: 'fast' (CI testing), 'qwen' (Qwen-2.5-Coder-1.5B), 'llama' (LLaMA-3.2-3B),'npu' (Train with NPU)", | ||
| ) | ||
|
|
||
| parser.add_argument( | ||
| "--active-agent", type=str, help="Override the active agent name (default: auto-generated based on config)" | ||
| ) | ||
|
|
||
| args = parser.parse_args() | ||
|
|
||
| # Get the appropriate configuration | ||
| config_functions = {"fast": config_train_fast, "qwen": config_train_qwen, "llama": config_train_llama} | ||
|
|
||
| config_functions = {"fast": config_train_fast,"qwen": config_train_qwen,"llama": config_train_llama,"npu": config_train_npu} | ||
| config = config_functions[args.config]() | ||
|
|
||
| # Set active agent - use provided value or default based on config choice | ||
| active_agent = args.active_agent | ||
|
|
||
|
|
@@ -196,4 +207,4 @@ def main() -> None: | |
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
| main() | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this file. We can post a separate blog if you want. We don't want to maintain a dedicated doc page for features unmaintained on CI.