See: https://database.lichess.org/#standard_games
Maybe use every nth game from the year 2013 before lichess grew in size, so the dataset covers a more or less equal amount of games per month while still covering a large time span, and to reduce the amount of games that need to be processed.
PS: I'm happy to provide some compute for this project with my google colab pro+ Subscription :)
See: https://database.lichess.org/#standard_games
Maybe use every nth game from the year 2013 before lichess grew in size, so the dataset covers a more or less equal amount of games per month while still covering a large time span, and to reduce the amount of games that need to be processed.
PS: I'm happy to provide some compute for this project with my google colab pro+ Subscription :)