[NeoML] DistributedTraining uses IsDnnInferenced #1110

favorart · 2024-09-05T09:17:16Z

Previously, if you did a RunOnce (even on random data) before a RunAndBackward, it would no longer be a firstRun and you could send batches as you wish. So you could never learn some extra dnn. Now you can not.

All dnns must have paramBlobs initialized to run solver->Train() for all of them (at least RunOnce must be completed for each dnn for this to happen).

The solver->Train() must run for all dnns because all dnns must have the same paramBlobs in each epoch.

Signed-off-by: Kirill Golikov <[email protected]>

favorart added the bug Something isn't working label Sep 5, 2024

[NeoML] DistributedTraining uses IsDnnInferenced

ae3256d

Signed-off-by: Kirill Golikov <[email protected]>

favorart force-pushed the golikovDistributedTrainRunOnce branch from 5ae5642 to ae3256d Compare September 5, 2024 10:59

favorart marked this pull request as draft September 9, 2024 18:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NeoML] DistributedTraining uses IsDnnInferenced #1110

[NeoML] DistributedTraining uses IsDnnInferenced #1110

favorart commented Sep 5, 2024 •

edited

Loading

[NeoML] DistributedTraining uses IsDnnInferenced #1110

Are you sure you want to change the base?

[NeoML] DistributedTraining uses IsDnnInferenced #1110

Conversation

favorart commented Sep 5, 2024 • edited Loading

favorart commented Sep 5, 2024 •

edited

Loading