Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wiki-redirects.txt file and tuto for preprocessing mgenre data #70

Open
Denescor opened this issue Feb 16, 2022 · 1 comment
Open

wiki-redirects.txt file and tuto for preprocessing mgenre data #70

Denescor opened this issue Feb 16, 2022 · 1 comment

Comments

@Denescor
Copy link

Hello

I'm trying to preprocess a wikidump for a custom mgenre workout but I don't have access to the {}wiki-redirects.txt file (with {} being the language of the wikidump).

This file is processed in preprocess_wikidata with the step option set to "redirects" to generate a pkl dictionary which will be used in process_anchor.
It is searched in the wikipedia_redirect/target folder.

I couldn't find any script to generate this redirect file from a wikipedia dump, nor any explanation of the format of the file so I couldn't recreate the script.

Similarly, I haven't found a tutorial explaining how to arrange the different mgenre preprocessing scripts in order to create the datasets and start learning. I think I understood the role of each script and the order in which to execute them, but I wouldn't mind having an explanation from the start.

Thank you for your answers.

@TommasoPetrolito
Copy link

Hello

I'm trying to preprocess a wikidump for a custom mgenre workout but I don't have access to the {}wiki-redirects.txt file (with {} being the language of the wikidump).

This file is processed in preprocess_wikidata with the step option set to "redirects" to generate a pkl dictionary which will be used in process_anchor. It is searched in the wikipedia_redirect/target folder.

I couldn't find any script to generate this redirect file from a wikipedia dump, nor any explanation of the format of the file so I couldn't recreate the script.

Similarly, I haven't found a tutorial explaining how to arrange the different mgenre preprocessing scripts in order to create the datasets and start learning. I think I understood the role of each script and the order in which to execute them, but I wouldn't mind having an explanation from the start.

Thank you for your answers.

Sorry for this very late reply, I don't know if this can still be useful but I noticed that the path mentioned here (scripts_mgenre/preprocess_wikidata.py):
with open( "wikipedia_redirect/target/{}wiki-redirects.txt".format(lang) ) as f:
is quite similar to

Extracted 537441 redirects. Saved output: /home/hideki/edu.cmu.lti.wikipedia_redirect/target/jawiki-redirect.txt Done in 49 sec.
Make sure the extracted redirects are stored in a tab-separated .txt file. $ ls target -lh -rw-r--r-- 1 hideki users 250M 2013-01-25 16:48 enwiki-redirect.txt -rw-r--r-- 1 hideki users 25M 2013-01-25 16:25 jawiki-redirect.txt

Tha can be found here https://code.google.com/archive/p/wikipedia-redirect/

mGENRE code then opens an treats it like a tab separated file as well so that I suspect this might be the tool used for it:

            with open(
                "wikipedia_redirect/target/{}wiki-redirects.txt".format(lang)
            ) as f:
                for row in tqdm(csv.reader(f, delimiter="\t"), desc=lang):
                    title = unquote(row[1]).split("#")[0].replace("_", " ")

This is just an assumption anyway, not sure 100% about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants