-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
While searching for a way to build on Arunmozhi's code base, I found the following project on Launchpad by Benjamin Thyreau - An application to easily read Wikipedia's downloaded dump files: https://launchpad.net/wikipediadumpreader As you can see from the above link, this is open source and dual licensed under Simplified BSD Licence and GNU GPL v2. By combining features from Benjamin's code and Arunmozhi's code, I have built an application that seems to do the basic functions OK. Please see the enclosed screenshot. Benjamin's code base had two things that we are looking for: PyQT4 user interface and more usable (though not complete) parsing of the wiki markup. On top of that he had built the ability to follow links. A user can click on hyperlinked words in the results to look-up those words further. However, on additional testing I discovered his code base had one big limitation for our use. It can only be used as an English to Tamil dictionary, but Tamil words cannot be looked up. This is because his indexing was not in unicode - as he himself noted in comments in his code. Arunmozhi uses Python Whoosh module for indexing and searching. Whoosh is natively built to handle Unicode. He also split the larger Wiktionary dump file into smaller chunks for faster look-ups. And he went a step further and built a Windows exe as well. "Wouldn't it be great if we can combine Benjamin's PyQT4 user interface and wiki parsing with Arunmozhi's indexing/searching and then build a Windows exe following Arunmozhi's steps", I thought. However, it turned out to be harder than it initially appeared ( Isn't it always ;-) ). Especially because I am new to Python! But, long story short, I have the modified code as well as the Windows exe now. The following are the only files needed for the exe to run: 1) Karthika.exe 2) Index folder "indexdir" 3) Wiktionary dump broken up in the "chunks" folder.
- Loading branch information
Showing
99 changed files
with
1,819 additions
and
0 deletions.
There are no files selected for viewing
Binary file not shown.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
import PyQt4.QtGui | ||
# Overload just the setSource member of QTextBrowser | ||
# Should only be necessary with Qt < 4.3 (missing Qt4.3's setOpenLinks(False)) | ||
class QTextBrowser2(PyQt4.QtGui.QTextBrowser): | ||
def setSource(*args): | ||
pass |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
A Wikipedia-Dump Reader. | ||
|
||
This Reader displays the text-only archives of wikipedia, which can be | ||
downloaded from : | ||
http://download.wikimedia.org/backup-index.html | ||
and are usually named like : | ||
pages-articles.xml.bz2 | ||
|
||
It requires Python, Qt and PyQt. Altough only Qt4/PyQt4 is supported now, the | ||
old Qt3/PyQt3 code is still included and should still work. | ||
It also assumes you have basic tools like gzip, zcat and zgrep, tail, head... | ||
|
||
(Optional) You will need the command line applications "texvc" and "latex" in | ||
order to render math expressions. (texvc is provided with this application) | ||
|
||
This reader is not yet complete although fairly useable in its current form. | ||
|
||
Usage | ||
----- | ||
1. on the commandline, run: | ||
python dumpReader.py | ||
or just click on it from your favorite file manager | ||
|
||
2. Browse and select the archive (some file probably named *.xml.bz2) | ||
|
||
3. If it's the first time, an index is created, which can take a lot of time. | ||
The english dumps currently need more than an hour. Note that if you | ||
abort during the index creation, it will be useable, altough obviously | ||
incomplete. (Useful for users who want to quicktest the program ;) | ||
Currently, the program need write permission on the same directory. | ||
|
||
4. The main windows contains the article title area (top), main text area | ||
(left) and article history (right). You can go to an article by typing | ||
its name then click the "Go" button, or by clicking a link from the main | ||
text area. By default, clicking a link load the article in the background. | ||
The search-box area allows to keyword search among the articles' title. | ||
You can also go to a random article by clicking "Go" with an empty entry. | ||
|
||
* You will need the command line application "Texvc" and in order to | ||
render math expressions. This tool requires "Latex". Note that it | ||
will use a directory (usually /tmp/wikipediaDumpReader_texvm/) to | ||
render the images, which is cleared at the restart of the application. | ||
|
||
FAQ | ||
--- | ||
Q. Can i get my dump quickly up-to-date while i'm online ? | ||
A. No. As far as i know, there is no way to "update" your currently downloaded | ||
xml.bz2 dump to sync it. The only way to get up-to-date is to delete the old | ||
dump (and also generated indexes files) and to fully re-download a new one. | ||
|
||
Q. I don't like the background-loading behaviour. Can i change it ? | ||
A. If you want to immediately see the content of clicked links, you have to | ||
manually modify the program : Edit the "dumpReader.py" file, go to the line | ||
which says "self.loadTabInBackground = True" and change "True" to "False". | ||
|
||
Q. Can i disable the graphical rendering of the maths ? ("latex rendering") | ||
A. Yes, but you will have to manually modify the program : Edit the | ||
"dumpReader.py" file, go to the line which says "self.latexRendering = True" | ||
and change "True" to "False" | ||
|
||
Q. Can i change the text size ? | ||
A. Font Size can now be changed, altough you will have to manually modify | ||
the program : Edit the "dumpReader.py" file, go to the line which says | ||
"fontSize = 9" and change "9" to whatever point size fits you best. | ||
This will only change the font size of the text area. | ||
|
||
Q. Can i edit the User Interface to change more settings ? | ||
A. If you have the Qt4 "designer" program, shipped with Qt-tools, you | ||
can edit "form3.ui" to fit your needs | ||
|
||
Q. What is the "debug" button ? | ||
A. This is needed only for developers. When toggle-on, each newly-loaded | ||
article is also copied on the upper area. When pressing "apply regex", | ||
it's filtered to the lower area. | ||
|
||
Q. The program says : RuntimeWarning: Python C API version mismatch for | ||
module bz2: This Python has API version 1013, module bz2 has version 1012. | ||
A. This can be safely ignored. This occurs because i provides a precompiled | ||
binary bz2.so module. You are welcome to recompile your own if you want | ||
from the src/ directory. Warning : this is NOT the standard bz2.so python | ||
module, it's a static copy with some changes. | ||
|
||
Q. How can I delete entries from the dump-selection initial dialog box ? | ||
A. There is no other way than editing the file ".wikipediadumpreaderrc" from | ||
your home directory and removing the lines you don't want. You may need | ||
to check "display hidden files" on your file manager to find this file. | ||
|
||
-- | ||
Benjamin Thyreau - 7/2009 | ||
[email protected] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
# Convert unsorted gzipped dump-entries list (ie. first-pass index) to sorted and | ||
# seekable gzipped entries list. Use the third-party 'zran' program almost unmodified | ||
|
||
import os | ||
from os.path import join as J | ||
import re, pickle | ||
import bisect, os | ||
|
||
global zranbin | ||
|
||
def assert_zran_runtime(): | ||
global zranbin | ||
zranbin = J(os.path.dirname(__file__), './zran_wdr') | ||
assert os.path.exists(zranbin), "can't find 'zran_wdr' binary at '%s'" % zranbin | ||
assert 'usage' in os.popen(zranbin + ' 2>&1').read(), "unexpected error calling 'zran_wdr'" | ||
|
||
def build_sorted_entrylist(zindexfilename): | ||
# assert everything is ok before starting | ||
assert_zran_runtime() | ||
assert zindexfilename.endswith('.idx.gz'), "wrongly named .idx.gz filename" | ||
zindexfilename_s = zindexfilename[:-3] + '_s.gz' | ||
assert not os.path.exists(zindexfilename_s), "a file named %s already exists" % zindexfilename_s | ||
assert 'sorted' in os.popen("LANG=C sort --help").read(), "unexpected error calling 'sort'" | ||
assert 'counts' in os.popen("LANG=C wc --help").read(), "unexpected error calling 'wc'" | ||
filesize = int(os.stat(zindexfilename)[6]) // 1024 | ||
tmp_freespace = int(os.popen('/bin/df -P /tmp').readlines()[1].split()[3]) | ||
assert filesize < tmp_freespace, "not enough space left on /tmp (report %dK, need %sK)" % (tmp_freespace, filesize) | ||
|
||
# Do actual sorting - blocking + slow + i don't think i can monitor progress | ||
|
||
tmpname = os.tmpnam() | ||
# it looks that utf8-encoded strings won't work on shell commands after an ">", thus tmpname | ||
print "zcat input | LANG=C sort | gzip -c > %s" % tmpname # this print was crashing the app with utf8 args when run from the gnome-panel (?!) | ||
os.popen(("zcat %s | LANG=C sort | gzip -c > %s" % (zindexfilename.encode('utf-8'), tmpname))) | ||
print "checking" | ||
nblines_old = int(os.popen(("zcat %s | wc -l" % zindexfilename).encode('utf-8')).read().strip()) | ||
nblines_new = int(os.popen("zcat %s | wc -l" % tmpname).read().strip()) | ||
assert nblines_new == nblines_old, "number of entries don't match" | ||
os.popen("/bin/mv -f %s %s" % (tmpname, zindexfilename.encode('utf-8'))) | ||
print "indexing entrylist" | ||
#filesize = int(os.stat(zindexfilename)[6]) / 100 | ||
bufsize = "409600" | ||
cmd = os.popen( zranbin + " %s -i %s -S %s -c 2>&1 | grep zran_index_save_point" % (zindexfilename.encode('utf-8'), zindexfilename_s.encode('utf-8'), bufsize)) | ||
L = [('', '0', '0')] | ||
for l in cmd: | ||
r = re.findall('(.*)zran_index_save_point out=(\d+), in=(\d+)_(.*)', l)[0] | ||
L.append((r[0]+r[3], r[1], r[2])) | ||
#print int(r[2]) // filesize # progress bar | ||
|
||
Ltxt = pickle.dumps(L, protocol=2) # almost __repr__ | ||
|
||
# Cat the entrylist tab and its file-offset at the end of the _s file. | ||
f = open(zindexfilename_s, 'a') | ||
f.seek(0, 2) | ||
l=f.tell() | ||
length = '0x%08X' % l | ||
f.write(Ltxt) | ||
f.write(length) | ||
f.close() | ||
print "Finished" | ||
|
||
def load_entrylist_table(zindexfilename_s): | ||
try: | ||
assert_zran_runtime() | ||
except AssertionError: | ||
return None | ||
f = open(zindexfilename_s) | ||
f.seek(-10, 2) | ||
f.seek(eval(f.read(10))) | ||
idx_s = pickle.loads(f.read()[:-10]) | ||
return idx_s | ||
|
||
def load_entry_addr(entry, idx_s, zindexfilename): # fixme entry & filename must be already utf8-decoded | ||
global zranbin | ||
zindexfilename_s = zindexfilename[:-3] + '_s.gz' | ||
i=bisect.bisect(idx_s, (entry,)) | ||
return i != len(idx_s) and os.popen(zranbin + ' %s -i %s %s -s %d | grep "^%s\t"' % (zindexfilename, zindexfilename_s, idx_s[i-1][1], int(idx_s[i][1]) - int(idx_s[i-1][1]) + 255, entry) ).read() | ||
|
Oops, something went wrong.