RWKV-PEFT

[ English | 中文 ]

RWKV-PEFT is the official implementation for efficient parameter fine-tuning of RWKV models, supporting various advanced fine-tuning methods across multiple hardware platforms.

Recent updates

Support huggingface/PEFT

You only need to check the usage examples of different methods in PEFT, then input the corresponding name and config correctly

LoRA:

--peft lora --peft_config '{"r":8,"lora_alpha":32,"lora_dropout":0.05}'

MiSS:

--peft miss --peft_config '{"r":8}'

Important

state tuning

--peft state --op fla

MiSS: Revisiting the Trade-off in LoRA with an Efficient Shard-Sharing Structure Paper

The method Bone/DiSHA has been officially renamed to MiSS. You can easily use it within PEFT (you’ll still see “Bone” for now, but it will be removed in future versions, so please use MiSS instead).

Installation

Important

Installation is mandatory.

git clone https://github.com/JL-er/RWKV-PEFT.git
cd RWKV-PEFT
uv sync   or  pip install .

Hardware Requirements

RWKV-7 Models

Below is the RWKV-7 model fine-tuned video memory requirement data, tested with RTX 4090 (24GB video memory) + 64GB RAM, based on the following parameter configurations:

Training precision: BF16
--strategy deepspeed_stage_1
--ctx_len 1024
--micro_bsz 1
--lora_r 64 or disha_config='{"mode":"bone","r":32}'

Model Parameters	State Tuning	LoRA	DiSHA	PiSSA
RWKV7-0.1B	2.6 GB	2.7 GB	2.7 GB	2.6 GB
RWKV7-0.4B	3.1 GB	3.4 GB	3.1 GB	3.4 GB
RWKV7-1.5B	5.3 GB	5.6 GB	5.6 GB	5.6 GB
RWKV7-3B	8.2 GB	8.8 GB	8.8 GB	8.8 GB

🔍 Click to view the VRAM requirements for quantized training of RWKV-7 models

INT8 VRAM Requirements

Model Parameters	State Tuning	LoRA	DiSHA	PiSSA
RWKV7-0.1B	2.4 GB	2.5 GB	2.5 GB	2.5 GB
RWKV7-0.4B	2.9 GB	2.9 GB	2.9 GB	3.0 GB
RWKV7-1.5B	4.1 GB	4.6 GB	4.5 GB	4.6 GB
RWKV7-3B	5.7 GB	6.7 GB	6.7 GB	6.7 GB

NF4 VRAM Requirements

Model Parameters	State Tuning	LoRA	DiSHA	PiSSA
RWKV7-0.1B	2.5 GB	2.4 GB	2.4 GB	2.4 GB
RWKV7-0.4B	2.8 GB	2.7 GB	2.7 GB	2.7 GB
RWKV7-1.5B	3.7 GB	3.9 GB	3.9 GB	3.9 GB
RWKV7-3B	4.7 GB	5.7 GB	5.7 GB	5.7 GB

🔍 Click to view the VRAM requirements of RWKV-6 models

The following shows memory usage when using an RTX 4090 (24GB VRAM) + 64GB RAM (with parameters: --strategy deepspeed_stage_1 --ctx_len 1024 --micro_bsz 1 --lora_r 64):

Model Size	Full Finetuning	LoRA/PISSA	QLoRA/QPISSA	State Tuning
RWKV6-1.6B	OOM	7.4 GB	5.6 GB	6.4 GB
RWKV6-3B	OOM	12.1 GB	8.2 GB	9.4 GB
RWKV6-7B	OOM	23.7 GB*	14.9 GB**	18.1 GB

Note:

OOM when batch size is 8 ** Requires 19.5GB VRAM when batch size is 8

Quick Start

Install dependencies:

pip install -r requirements.txt

Run example script:

sh scripts/run_lora.sh

Note: Please refer to the RWKV official tutorial for detailed data preparation

Main Features

Multiple Fine-tuning Methods: Supports LoRA, PISSA, Bone, State Tuning, etc.
Quantized Training: Supports INT8/NF4 quantization for significant VRAM reduction
Flexible Data Loading: Supports various data sampling strategies
Memory Optimization: Multiple DeepSpeed strategies available
Loss Masking: Supports loss masking for QA dialogue and padding
Infinite Context Training: Supports infctx training mode, utilizing RWKV's constant memory usage advantage to train with "infinite" context under limited resources
Multi-Hardware Support: RWKV-PEFT officially supports NVIDIA, AMD, Moore Threads, Musa, Iluvatar CoreX, and other hardware platforms. Ascend NPU implementation will be available later. Note: Currently we only support issues for NVIDIA hardware
RWKV-FLA Efficient Training: rwkv-fla is a Triton-based linear attention operator that can run efficiently on hardware without CUDA support

Detailed Configuration

PEFT Method Selection

--peft lora --peft_config '{"r":8,"lora_alpha":32,"lora_dropout":0.05}'

[state,lora,miss]

Quantized Training

--quant int8/nf4

Infinite Length Training (infctx)

--train_type infctx --chunk_ctx 512 --ctx_len 2048

ctx_len: Target training length
chunk_ctx: Slice length, must be smaller than ctx_len

DeepSpeed Strategy

--strategy deepspeed_stage_1

Available strategies:

deepspeed_stage_1: Preferred option
deepspeed_stage_2/3: For large models or full fine-tuning
deepspeed_stage_2_offload
deepspeed_stage_3_offload

FLA Operator

By default, RWKV-PEFT uses custom CUDA kernels for wkv computation. However, you can use --op fla to enable the Triton kernel:

--op fla

GPU Support

NVIDIA: CUDA
Intel, Moore Threads, Musa, Iluvatar CoreX: FLA, which means you need to pass --fla
Ascend: CANN (soon)

Citation

If you find this project helpful, please cite our work:

@misc{kang2025missrevisitingtradeofflora,
      title={MiSS: Revisiting the Trade-off in LoRA with an Efficient Shard-Sharing Structure}, 
      author={Jiale Kang and Qingyu Yin},
      year={2025},
      eprint={2409.15371},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.15371}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 303 Commits
assert		assert
cuda		cuda
json2binidx_tool		json2binidx_tool
merge		merge
rwkvt		rwkvt
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
train.py		train.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RWKV-PEFT

Recent updates

Support huggingface/PEFT

MiSS: Revisiting the Trade-off in LoRA with an Efficient Shard-Sharing Structure Paper

Installation

Table of Contents

Hardware Requirements

RWKV-7 Models

INT8 VRAM Requirements

NF4 VRAM Requirements

Quick Start

Main Features

Detailed Configuration

PEFT Method Selection

Quantized Training

Infinite Length Training (infctx)

DeepSpeed Strategy

FLA Operator

GPU Support

Citation

About

Uh oh!

Releases

Packages

Languages

License

frankandleaf/RWKV-PEFT

Folders and files

Latest commit

History

Repository files navigation

RWKV-PEFT

Recent updates

Support huggingface/PEFT

MiSS: Revisiting the Trade-off in LoRA with an Efficient Shard-Sharing Structure Paper

Installation

Table of Contents

Hardware Requirements

RWKV-7 Models

INT8 VRAM Requirements

NF4 VRAM Requirements

Quick Start

Main Features

Detailed Configuration

PEFT Method Selection

Quantized Training

Infinite Length Training (infctx)

DeepSpeed Strategy

FLA Operator

GPU Support

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages