You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is not clear to me how I should download the dumps for pageview and wikidata. Here is the link to dumps for pageview but It's not clear which one is the right one?
Each of the .gz files represent an hour on a certain date. And each line in the file represents the number of pageviews for a page in that hour. So the more data you have on pageviews, the less noisy the number will be. I would recommend having at least a weeks worth of pageviews to cover any weekly fluctuations, but a month is probably ideal.
Picking the same month as the other dumps would be best. The page names are most likely to match that way.
The time 20201201 that is hardcoded in define_es was the last time we recreated the index but you don't necessarily have to get that specific month.
While having date names in indices was fine for us, most users of the open source are going to index once so the dates are adding unnecessary complexity. I'll consider removing the dates from the index names in the next release.
It is not clear to me how I should download the dumps for pageview and wikidata. Here is the link to dumps for pageview but It's not clear which one is the right one?
https://dumps.wikimedia.org/other/pageviews/2020/2020-12/
Should I merge them in the specific time of 20201201 that is hardcoded in define_es.py?
The text was updated successfully, but these errors were encountered: