Skip to content

Latest commit

 

History

History
176 lines (152 loc) · 4.57 KB

File metadata and controls

176 lines (152 loc) · 4.57 KB

Kanripo Web Interface

Search

fulltext search with kwic

upon search, the following should happen:

  • a list of the first n items is returned, with links to the texts (or the default for n+x is: return all)
  • this list can be browsed and paged to the next
  • at the same time a breakdown of the most frequent 部類 and texts is returned to other parts of the page, this can be used to further filter etc.
    • in the future, additional aspects for the texts should also be available, like dynasty, author etc.
  • this is realized by putting the whole result into redis:
    • when doing a search, first check if it is in redis
      • redis results will have an expiry, but will all be deleted upon a new index
    • if not, do the search and put the resutls in redis, including the analysis
    • if yes, just get the stuff
import app.main.views
key=u"原文"
ox = app.main.views.doftsearch(key)
ux = ox.decode('utf8')
s=ux.split('\n')
#this gives a list of the text number of all matches:
b = [a.split('\t')[1].split(':')[0] for a in s if "\t" in a]

from collections import Counter
#this counts b and gives the most frequent first
c=Counter(b)
#this gets just the sections
b1 = [a.split('\t')[1][0:4] for a in s if "\t" in a]
for t, k in c.most_common(10):
  print t, k
for t, k in c1.most_common(10):
  print t, k

redis implementation

  • the results are stored using the search-key as key (e.g. u”原文”), in a list
    • this makes it possible to access them by indes
    • Q: what to do for different sort criteria?
  • > store this in another key with the search criteria appended, e.g. u”原文-prev” etc.
    • need a typololgy of possible search keys
      • also store u”原文-texts” and u”原文-sections” for a hash of results for these parts:
  • u”原文-texts” will have the result of b from above
  • u”原文-sections” will have the results of b1 from above
    • all keys will have a convenient expire attached.
import app.main.views
r=app.main.views.redis_store
key=u"原文"
ox = app.main.views.doftsearch(key)
ux = ox.decode('utf8')
s=ux.split('\n')
# we want to sort the results, since they come in unsorted!
s.sort()
subset = s[0:10]
r.rpush(key, subset)
r.lrange(key, 0, 20)

=> Upgraded to 2.8.2, also needed to update conf.

metadata

wrote a script to import the ~/$mandoku/meta/ catalog files into redis: /Users/chris/krp/mandoku/python/mandoku/mdcatalog2redis.py

gitlab api

I am using now the gitlab api with a private token to pull things in and send them to the user. This ensures that I have the latest version.

TODO:

  • make a proper flask extension with configuration
  • expose different branches, users forks etc.

templates

  • Got the template structure etc. ported over from krp_『道藏輯要』 for this, needed
    • flask-Babel extension and initialization code,
    • pybabel translation catalogs etc
      • TODO: the ja message catalog is still fuzzy for some reason: investigate and correct.
    • for the moment using the bootstrap files etc. from the old version, not the flask extension, might need to update this later. [2014-09-05T20:30:39+0900]
    • now ready to put it together!

catalog

count ids etc:

from app import redis_store as r
c = r.keys('zb:meta*')
a = [r.hgetall(b).keys() for b in c]
counter = Counter(itertools.chain(*a))

RESULT:

LINKS 9271 TITLE 9271 RESP 7330 DYNASTY 7362 page 1528 040314a WITLIST 9271

DZ 1528 DZ0226 HFL 1528 ‘HFL\xe7\x8f\xa0\xe4\xb8\x8a022’ = HFL珠上022 CBETA_ID 4275 T50n2046 XWDZ 1528 XWDZ06p0828 ZHDZ 1527 ZHDZ19p0143 ID 9271 ZB5a0227

dynasties:

a = [[r.hgetall(b)[k] for k in r.hgetall(b).keys() if k == 'DYNASTY'] for b in c]
c2 = Counter(itertools.chain(*a))
for c in c2.most_common(40).keys():
   print c, cn

宋 1958 明 1064 唐 1052 清 1016 元 446 隋 153 西晉 141 失譯 137 民國 101 後漢 79 劉宋 78 漢 75 吳 63 東晉 52 梁 52 姚秦 43 元魏 43 日本 42 晉 33 新羅 31 □ 25 後秦 20 陳 20 金 20 後魏 16 天親菩薩造 15 龍樹菩薩造 14 高麗 12 方廣錩整理 12 北涼 12 魏 11 世親菩薩造 11 北周 10 南唐 9 無著菩薩造 9 達照整理 8 張總整理 8 闕譯 7 遼 7 方廣錩 6