📚 Documentation
I believe the optimizer in this example should be declared after the parallelize module call, as in sequence parallelism. Without this, in latest torch, the example seems to not update the weights and thus not truly train. Please lmk if im missing anything and thanks so much for all your work!
Tiny fix PR below:
#1324