If the fMRI data is not normalized, it can indeed affect the model's fitting performance. I tried feeding my data (without normalization) into the model for a three-class classification task, and it learned nothing. The classification accuracy remained around 0.33, and the loss (cross-entropy) stayed around 1.1. To prove the model was not the problem, I generated random samples and obtained good fitting results, with the loss approaching 0 and the accuracy close to 1. Since I am using a spatio-temporal transformer architecture, I suspect that the lack of normalization might have a significant impact.