-
Notifications
You must be signed in to change notification settings - Fork 2
Software to build a dataset collection similar to the techtc300 http://techtc.cs.technion.ac.il/techtc300/
ngeiswei/techtc-builder
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
INTRODUCTION ------------ techtc-builder is a software to build a dataset collection similar to the techtc300 http://techtc.cs.technion.ac.il/techtc300/. REQUIREMENTS ------------ * For build-techtc.py - Python 2.7 or above - python-lxml - dmoz RDF dump files structure.rdf.u8 and content.rdf.u8. Can be downloaded at http://www.dmoz.org/rdf.html - One the following text based web browser w3m, lynx, elinks, links, links2 - wget * For techtc2CSV_all.sh - gnu parallel - PyStemmer * For CSV_MI_filter.sh - OpenCog (more specifically feature-selection) INSTALLATION ------------ ** This is not required if you run the scripts from the project root ** Run the following script (most likely in super user mode) # ./install.sh You can specify the prefix where to install with option --prefix, such as # ./install --prefix /usr USAGE ----- 1) Download structure.rdf.u8 and content.rdf.u8 from http://www.dmoz.org/rdf.html 2) Strip the dmoz files from useless content: $ strip_dmoz_rdf.py structure.rdf.u8 -o structure_stripped.rdf.u8 $ strip_dmoz_rdf.py content.rdf.u8 -o content_stripped.rdf.u8 This is gonna take a few minutes but will greatly speed up the next step as well as lower memory usage. 3) Create a techtc300 $ ./build-techtc.py -c content_stripped.rdf.u8 -s structure_stripped.rdf.u8 -S 300 This will create a directory techtc300 with the dataset collection inside. In fact the options are the default ones so the following $ ./build-techtc.py does the same thing. There are multiple options, you can get the list of them using --help. The default (and recommended) tool used to convert html to text is w3m, you can change that using option -H such as $ build-techtc.py -H html2text here it will use html2text (http://www.mbayer.de/html2text/files.shtml) instead of w3m. 4) Remove ill-formed directories (if positive or negative text files are missing). Here the directory of the collection is techtc300, replace if appropriate $ ./strip_techtc techtc300 5) -- optional -- convert the text files into feature vectors in CSV format. Each row corresponds to a document, and each column corresponds to a word (0 if it does not appear in the document, 1 otherwise). The first row shows the corresponding word. Just run () $ ./techtc2CSV.py techtc300 is gonna create data.csv files under each Exp_XXXX_XXXX directory. Then, if you don't need any more the intermediary text files you can run $ ./strip_csv.sh techtc300 Beware that this is gonna permanently remove everything expect the CSV files. Since the web is full of mess you may end-up with CSV file with thousands of words plus lots of gibberish. To filter that out you can use a filter that will keep only feature with their mutual information with the target above a certain threshold, run $ ./CSV_MI_filter.sh techtc300 0.05 The script is gonna copy the content under techtc300_MI_0.05 where 0.05 is the threshold, then rewrite all data.csv files filtered. This steps uses feature-selection a tool provided with OpenCog project http://wiki.opencog.org/w/Building_OpenCog AUTHOR ------ Nil Geisweiller email: [email protected]
About
Software to build a dataset collection similar to the techtc300 http://techtc.cs.technion.ac.il/techtc300/
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published