You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to preprocess a wikidump for a custom mgenre workout but I don't have access to the {}wiki-redirects.txt file (with {} being the language of the wikidump).
This file is processed in preprocess_wikidata with the step option set to "redirects" to generate a pkl dictionary which will be used in process_anchor.
It is searched in the wikipedia_redirect/target folder.
I couldn't find any script to generate this redirect file from a wikipedia dump, nor any explanation of the format of the file so I couldn't recreate the script.
Similarly, I haven't found a tutorial explaining how to arrange the different mgenre preprocessing scripts in order to create the datasets and start learning. I think I understood the role of each script and the order in which to execute them, but I wouldn't mind having an explanation from the start.
Thank you for your answers.
The text was updated successfully, but these errors were encountered:
I'm trying to preprocess a wikidump for a custom mgenre workout but I don't have access to the {}wiki-redirects.txt file (with {} being the language of the wikidump).
This file is processed in preprocess_wikidata with the step option set to "redirects" to generate a pkl dictionary which will be used in process_anchor. It is searched in the wikipedia_redirect/target folder.
I couldn't find any script to generate this redirect file from a wikipedia dump, nor any explanation of the format of the file so I couldn't recreate the script.
Similarly, I haven't found a tutorial explaining how to arrange the different mgenre preprocessing scripts in order to create the datasets and start learning. I think I understood the role of each script and the order in which to execute them, but I wouldn't mind having an explanation from the start.
Thank you for your answers.
Sorry for this very late reply, I don't know if this can still be useful but I noticed that the path mentioned here (scripts_mgenre/preprocess_wikidata.py): with open( "wikipedia_redirect/target/{}wiki-redirects.txt".format(lang) ) as f: is quite similar to
Extracted 537441 redirects. Saved output: /home/hideki/edu.cmu.lti.wikipedia_redirect/target/jawiki-redirect.txt Done in 49 sec.
Make sure the extracted redirects are stored in a tab-separated .txt file. $ ls target -lh -rw-r--r-- 1 hideki users 250M 2013-01-25 16:48 enwiki-redirect.txt -rw-r--r-- 1 hideki users 25M 2013-01-25 16:25 jawiki-redirect.txt
mGENRE code then opens an treats it like a tab separated file as well so that I suspect this might be the tool used for it:
with open(
"wikipedia_redirect/target/{}wiki-redirects.txt".format(lang)
) as f:
for row in tqdm(csv.reader(f, delimiter="\t"), desc=lang):
title = unquote(row[1]).split("#")[0].replace("_", " ")
This is just an assumption anyway, not sure 100% about this.
Hello
I'm trying to preprocess a wikidump for a custom mgenre workout but I don't have access to the {}wiki-redirects.txt file (with {} being the language of the wikidump).
This file is processed in preprocess_wikidata with the step option set to "redirects" to generate a pkl dictionary which will be used in process_anchor.
It is searched in the wikipedia_redirect/target folder.
I couldn't find any script to generate this redirect file from a wikipedia dump, nor any explanation of the format of the file so I couldn't recreate the script.
Similarly, I haven't found a tutorial explaining how to arrange the different mgenre preprocessing scripts in order to create the datasets and start learning. I think I understood the role of each script and the order in which to execute them, but I wouldn't mind having an explanation from the start.
Thank you for your answers.
The text was updated successfully, but these errors were encountered: