Hi, I tried to reproduce the transformer hawkes process on StackOverflow fold1. However, the results of accuracy and RMSE is as below.

I think I have something missing. Compared to the relased code of Self-Attentive Hawkes process, I think it is not because of scaling factor. What does make the difference between the paper and this repository?