Skip to content

Commit

Permalink
PyQT4 folder added
Browse files Browse the repository at this point in the history
While searching for a way to build on Arunmozhi's code base, I found the
following project on Launchpad by Benjamin Thyreau - An application to
easily read Wikipedia's downloaded dump files:
https://launchpad.net/wikipediadumpreader

As you can see from the above link, this is open source and dual
licensed under Simplified BSD Licence and GNU GPL v2.

By combining features from Benjamin's code and Arunmozhi's code, I have
built an application that seems to do the basic functions OK. Please see
the enclosed screenshot.

Benjamin's code base had two things that we are looking for: PyQT4 user
interface and more usable (though not complete) parsing of the wiki
markup. On top of that he had built the ability to follow links. A user
can click on hyperlinked words in the results to look-up those words
further. However, on additional  testing I discovered his code base had
one big limitation for our use. It can only be used as an English to
Tamil dictionary, but Tamil words cannot be looked up. This is because
his indexing was not in unicode - as he himself noted in comments in his
code.

Arunmozhi uses Python Whoosh module for indexing and searching. Whoosh
is natively built to handle Unicode. He also split the larger Wiktionary
dump file into smaller chunks for faster look-ups. And he went a step
further and built a Windows exe as well.

"Wouldn't it be great if we can combine Benjamin's PyQT4 user interface
and wiki parsing with Arunmozhi's indexing/searching and then build a
Windows exe following Arunmozhi's steps", I thought. However, it turned
out to be harder than it initially appeared ( Isn't it always ;-) ).
Especially because I am new to Python! But, long story short, I have the
modified code as well as the Windows exe now.

The following are the only files needed for the exe to run:
1) Karthika.exe
2) Index folder "indexdir"
3) Wiktionary dump broken up in the "chunks" folder.
  • Loading branch information
AshokR committed Jun 10, 2012
1 parent f700b21 commit d55343d
Show file tree
Hide file tree
Showing 99 changed files with 1,819 additions and 0 deletions.
Binary file added PyQT4/Karthika.exe
Binary file not shown.
637 changes: 637 additions & 0 deletions PyQT4/Karthika.py

Large diffs are not rendered by default.

6 changes: 6 additions & 0 deletions PyQT4/QTextBrowser2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
import PyQt4.QtGui
# Overload just the setSource member of QTextBrowser
# Should only be necessary with Qt < 4.3 (missing Qt4.3's setOpenLinks(False))
class QTextBrowser2(PyQt4.QtGui.QTextBrowser):
def setSource(*args):
pass
90 changes: 90 additions & 0 deletions PyQT4/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
A Wikipedia-Dump Reader.

This Reader displays the text-only archives of wikipedia, which can be
downloaded from :
http://download.wikimedia.org/backup-index.html
and are usually named like :
pages-articles.xml.bz2

It requires Python, Qt and PyQt. Altough only Qt4/PyQt4 is supported now, the
old Qt3/PyQt3 code is still included and should still work.
It also assumes you have basic tools like gzip, zcat and zgrep, tail, head...

(Optional) You will need the command line applications "texvc" and "latex" in
order to render math expressions. (texvc is provided with this application)

This reader is not yet complete although fairly useable in its current form.

Usage
-----
1. on the commandline, run:
python dumpReader.py
or just click on it from your favorite file manager

2. Browse and select the archive (some file probably named *.xml.bz2)

3. If it's the first time, an index is created, which can take a lot of time.
The english dumps currently need more than an hour. Note that if you
abort during the index creation, it will be useable, altough obviously
incomplete. (Useful for users who want to quicktest the program ;)
Currently, the program need write permission on the same directory.

4. The main windows contains the article title area (top), main text area
(left) and article history (right). You can go to an article by typing
its name then click the "Go" button, or by clicking a link from the main
text area. By default, clicking a link load the article in the background.
The search-box area allows to keyword search among the articles' title.
You can also go to a random article by clicking "Go" with an empty entry.

* You will need the command line application "Texvc" and in order to
render math expressions. This tool requires "Latex". Note that it
will use a directory (usually /tmp/wikipediaDumpReader_texvm/) to
render the images, which is cleared at the restart of the application.

FAQ
---
Q. Can i get my dump quickly up-to-date while i'm online ?
A. No. As far as i know, there is no way to "update" your currently downloaded
xml.bz2 dump to sync it. The only way to get up-to-date is to delete the old
dump (and also generated indexes files) and to fully re-download a new one.

Q. I don't like the background-loading behaviour. Can i change it ?
A. If you want to immediately see the content of clicked links, you have to
manually modify the program : Edit the "dumpReader.py" file, go to the line
which says "self.loadTabInBackground = True" and change "True" to "False".

Q. Can i disable the graphical rendering of the maths ? ("latex rendering")
A. Yes, but you will have to manually modify the program : Edit the
"dumpReader.py" file, go to the line which says "self.latexRendering = True"
and change "True" to "False"

Q. Can i change the text size ?
A. Font Size can now be changed, altough you will have to manually modify
the program : Edit the "dumpReader.py" file, go to the line which says
"fontSize = 9" and change "9" to whatever point size fits you best.
This will only change the font size of the text area.

Q. Can i edit the User Interface to change more settings ?
A. If you have the Qt4 "designer" program, shipped with Qt-tools, you
can edit "form3.ui" to fit your needs

Q. What is the "debug" button ?
A. This is needed only for developers. When toggle-on, each newly-loaded
article is also copied on the upper area. When pressing "apply regex",
it's filtered to the lower area.

Q. The program says : RuntimeWarning: Python C API version mismatch for
module bz2: This Python has API version 1013, module bz2 has version 1012.
A. This can be safely ignored. This occurs because i provides a precompiled
binary bz2.so module. You are welcome to recompile your own if you want
from the src/ directory. Warning : this is NOT the standard bz2.so python
module, it's a static copy with some changes.

Q. How can I delete entries from the dump-selection initial dialog box ?
A. There is no other way than editing the file ".wikipediadumpreaderrc" from
your home directory and removing the lines you don't want. You may need
to check "display hidden files" on your file manager to find this file.

--
Benjamin Thyreau - 7/2009
[email protected]
78 changes: 78 additions & 0 deletions PyQT4/convert_idx_s.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Convert unsorted gzipped dump-entries list (ie. first-pass index) to sorted and
# seekable gzipped entries list. Use the third-party 'zran' program almost unmodified

import os
from os.path import join as J
import re, pickle
import bisect, os

global zranbin

def assert_zran_runtime():
global zranbin
zranbin = J(os.path.dirname(__file__), './zran_wdr')
assert os.path.exists(zranbin), "can't find 'zran_wdr' binary at '%s'" % zranbin
assert 'usage' in os.popen(zranbin + ' 2>&1').read(), "unexpected error calling 'zran_wdr'"

def build_sorted_entrylist(zindexfilename):
# assert everything is ok before starting
assert_zran_runtime()
assert zindexfilename.endswith('.idx.gz'), "wrongly named .idx.gz filename"
zindexfilename_s = zindexfilename[:-3] + '_s.gz'
assert not os.path.exists(zindexfilename_s), "a file named %s already exists" % zindexfilename_s
assert 'sorted' in os.popen("LANG=C sort --help").read(), "unexpected error calling 'sort'"
assert 'counts' in os.popen("LANG=C wc --help").read(), "unexpected error calling 'wc'"
filesize = int(os.stat(zindexfilename)[6]) // 1024
tmp_freespace = int(os.popen('/bin/df -P /tmp').readlines()[1].split()[3])
assert filesize < tmp_freespace, "not enough space left on /tmp (report %dK, need %sK)" % (tmp_freespace, filesize)

# Do actual sorting - blocking + slow + i don't think i can monitor progress

tmpname = os.tmpnam()
# it looks that utf8-encoded strings won't work on shell commands after an ">", thus tmpname
print "zcat input | LANG=C sort | gzip -c > %s" % tmpname # this print was crashing the app with utf8 args when run from the gnome-panel (?!)
os.popen(("zcat %s | LANG=C sort | gzip -c > %s" % (zindexfilename.encode('utf-8'), tmpname)))
print "checking"
nblines_old = int(os.popen(("zcat %s | wc -l" % zindexfilename).encode('utf-8')).read().strip())
nblines_new = int(os.popen("zcat %s | wc -l" % tmpname).read().strip())
assert nblines_new == nblines_old, "number of entries don't match"
os.popen("/bin/mv -f %s %s" % (tmpname, zindexfilename.encode('utf-8')))
print "indexing entrylist"
#filesize = int(os.stat(zindexfilename)[6]) / 100
bufsize = "409600"
cmd = os.popen( zranbin + " %s -i %s -S %s -c 2>&1 | grep zran_index_save_point" % (zindexfilename.encode('utf-8'), zindexfilename_s.encode('utf-8'), bufsize))
L = [('', '0', '0')]
for l in cmd:
r = re.findall('(.*)zran_index_save_point out=(\d+), in=(\d+)_(.*)', l)[0]
L.append((r[0]+r[3], r[1], r[2]))
#print int(r[2]) // filesize # progress bar

Ltxt = pickle.dumps(L, protocol=2) # almost __repr__

# Cat the entrylist tab and its file-offset at the end of the _s file.
f = open(zindexfilename_s, 'a')
f.seek(0, 2)
l=f.tell()
length = '0x%08X' % l
f.write(Ltxt)
f.write(length)
f.close()
print "Finished"

def load_entrylist_table(zindexfilename_s):
try:
assert_zran_runtime()
except AssertionError:
return None
f = open(zindexfilename_s)
f.seek(-10, 2)
f.seek(eval(f.read(10)))
idx_s = pickle.loads(f.read()[:-10])
return idx_s

def load_entry_addr(entry, idx_s, zindexfilename): # fixme entry & filename must be already utf8-decoded
global zranbin
zindexfilename_s = zindexfilename[:-3] + '_s.gz'
i=bisect.bisect(idx_s, (entry,))
return i != len(idx_s) and os.popen(zranbin + ' %s -i %s %s -s %d | grep "^%s\t"' % (zindexfilename, zindexfilename_s, idx_s[i-1][1], int(idx_s[i][1]) - int(idx_s[i-1][1]) + 255, entry) ).read()

Loading

0 comments on commit d55343d

Please sign in to comment.