I've been trying this out for the last few weeks and am not deep in this space but do have some older DS skills from the pre-LLM era where we built NN and used Xgboost etc.
As I went through this I wished there were more clarity on the hardware requirements I needed. I spent a good chunk of time trying out different EC2 instances to figure out which had the right GPU and CUDA hardware/software respectively and in the end still needed to suppress some errors to get to training on the NanoGPT Template