Skip to content

Conversation

@zlH518
Copy link
Contributor

@zlH518 zlH518 commented Nov 8, 2025

I found that after running simple_example.py, nvidia-smi still showed that I had processes consuming memory. I suspected that the created processes were not being cleared, so I added code to clean up the processes, which solved the problem.

@zlH518
Copy link
Contributor Author

zlH518 commented Nov 8, 2025

i also find when i set the n_stage>1, there are some device error.
I fixed the device transfer logic in modeling_qwen3.py. Key improvements include: Dynamically obtaining device location: Instead of hardcoding torch.device(0) or torch.device(i+1), the target device is determined by querying the device location of the actual model parameters.
Device transfer after embedding: After performing embedding, hidden_states is immediately transferred to the device of the first layer to avoid device mismatch.
Inter-layer device transfer: When transferring between different device groups, the actual device location of the next layer is queried instead of assuming a device index.

@zheyishine
Copy link
Collaborator

Appreciate it!
It is better to apply your second PR to the scaffold.py file, as most of our modeling codes are auto-generated by scaffold.py instead of copy-and-revise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants