Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading the dumps to start the elasticsearch indexing #31

Open
roholazandie opened this issue Jun 21, 2021 · 2 comments
Open

Downloading the dumps to start the elasticsearch indexing #31

roholazandie opened this issue Jun 21, 2021 · 2 comments

Comments

@roholazandie
Copy link

It is not clear to me how I should download the dumps for pageview and wikidata. Here is the link to dumps for pageview but It's not clear which one is the right one?

https://dumps.wikimedia.org/other/pageviews/2020/2020-12/

Should I merge them in the specific time of 20201201 that is hardcoded in define_es.py?

@AshwinParanjape
Copy link
Contributor

Each of the .gz files represent an hour on a certain date. And each line in the file represents the number of pageviews for a page in that hour. So the more data you have on pageviews, the less noisy the number will be. I would recommend having at least a weeks worth of pageviews to cover any weekly fluctuations, but a month is probably ideal.

Picking the same month as the other dumps would be best. The page names are most likely to match that way.

The time 20201201 that is hardcoded in define_es was the last time we recreated the index but you don't necessarily have to get that specific month.

@AshwinParanjape
Copy link
Contributor

If you would like to change the hardcoded paths in wiki, you would need to do it in the following places

While having date names in indices was fine for us, most users of the open source are going to index once so the dates are adding unnecessary complexity. I'll consider removing the dates from the index names in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants