Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NeoML] DistributedTraining uses IsDnnInferenced #1110

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

favorart
Copy link
Contributor

@favorart favorart commented Sep 5, 2024

Previously, if you did a RunOnce (even on random data) before a RunAndBackward, it would no longer be a firstRun and you could send batches as you wish. So you could never learn some extra dnn. Now you can not.

All dnns must have paramBlobs initialized to run solver->Train() for all of them (at least RunOnce must be completed for each dnn for this to happen).

The solver->Train() must run for all dnns because all dnns must have the same paramBlobs in each epoch.

@favorart favorart added the bug Something isn't working label Sep 5, 2024
@favorart favorart force-pushed the golikovDistributedTrainRunOnce branch from 5ae5642 to ae3256d Compare September 5, 2024 10:59
@favorart favorart marked this pull request as draft September 9, 2024 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant