Skip to content

Commit 950d34f

Browse files
committed
Initialized github commit
0 parents  commit 950d34f

12 files changed

+1414
-0
lines changed

INSTALL.txt

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
Each script can be manipulated on his own, you could therefore install dependencies SEPARATELY. It is suggested, notably because some dependency could be heavy and specially if you don(t use all the scripts.
2+
3+
4+
General Dependencies:
5+
====================
6+
7+
python3, python3-pip
8+
9+
10+
Dependencies for import.py:
11+
==========================-
12+
pip3 install pysolr
13+
14+
Dependencies for topics.py:
15+
===========================
16+
pip3 install gensim, nltk
17+
18+
To get appropriate tokenizer, you need to download punkt stemmer (http://www.nltk.org/_modules/nltk/tokenize/punkt.html):
19+
In a shell:
20+
$python3
21+
>>>import nltk
22+
>>>nltk.download()
23+
Downloader>d
24+
identifier> punkt
25+
26+
Dependencies for graph.py:
27+
=========================
28+
pip3 install pandas, matplotlib
29+
apt-get install python3-tk
30+
31+
Dependencies for topics_visu.py:
32+
===================================
33+
pip3 install wordcloud

LICENSE.md

Lines changed: 651 additions & 0 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
#toolpic : Topic Modelling
2+
3+
4+
toolpic is a toolkit for process topics modelling. Written in python, it is based on unsupervised learning, latent dirichlet allocation (LDA). This algorithm is implemented in gensim (https://radimrehurek.com/gensim/index.html). 4 open source scripts managed the full process. The fisrt one (import.py) is for extract and import text from openedition database (according to published years and given langages). The second one (topic.py) clean/stem text in order to fit training parameters or to train directly the topic model. The third one (graph.py) is just a plot to see dependency of parameter. The last one (topic_visualisation.py) allow to visualize each topic (two format available).
5+
6+
7+
##Contributors
8+
9+
Mathieu Orban.
10+
11+
##Installation
12+
13+
See INSTALL.txt
14+
15+
##Usage
16+
17+
See README.txt
18+
19+
##Licence
20+
21+
toolpic is released under the terms of the GNU AFFERO GENERAL PUBLIC LICENSE
22+
23+
##Documentation
24+

README.txt

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
USAGE:
2+
=====
3+
For any script you can get more details options with:
4+
$python3 <script.py> -h
5+
6+
7+
8+
*******************
9+
import.py
10+
*******************
11+
12+
Resume:
13+
------
14+
This script is dedicated to import full text from solr by platform (required option), by selected years and by langage.
15+
Each text is saved in one file. The file is saved under a directory of its year.
16+
17+
18+
settings :
19+
---------
20+
You need first to specify your solr url in the settings.py
21+
22+
Exemple:
23+
-------
24+
To import french text for revues.org published in 2005 and 2007:
25+
$python3 import.py -p RO -i 2005 2007 -l fr -d mydirectory
26+
27+
Note:
28+
----
29+
The -q --query option is not yet implemented. ['fr', 'en', 'es'..]: OpenEdition abbreviation for specified a langage.
30+
31+
32+
******************
33+
topics.py
34+
******************
35+
Resume:
36+
------
37+
This script is dedicated to:
38+
-Get a corpus directory.
39+
-Clean and/or stem each documents of this corpus and fit this corpus for gensim format.
40+
-Bag of word model
41+
-TfIdf transformation.
42+
-LDA Model (in multiprocess) is running (according to seyting option) to:
43+
-Fit alpha parameter and number of topics. A csv file is generated to evaluate the log_perplexity.
44+
-Generate a Topic Model
45+
46+
47+
Files generated:
48+
--------------
49+
GENSIM FORMAT TEXT: '/tmp/gensim_docs.txt' is generated. A big file.txt where one line is on document. (format required for gensim)
50+
LOG: './lda_model.log' (by default) is always saved on the current directory (logging python library)
51+
FIT PARAMETERS: (when option -f=True): './fit_result.csv' is saved on a current directory. This file could be used for visualisation in graph.py
52+
TRAIN MODEL: (option -f=False; default option): './lda.model.expElogbeta.npy', './lda.model.id2word', './lda.model.state', './lda.model' is saved on current directory. ONLY 'lda.model' could be used for data visualisation in topic_visualisation.py
53+
54+
dictionary:
55+
---------
56+
By default a stop_tartarus.txt (http://snowball.tartarus.org/algorithms/french/stemmer.html) anda stop_calenda.txt is furnished. This a list of stop word
57+
58+
Exemple:
59+
-------
60+
To fit the best parameter with gensim info log for french text (documents is under directory train):
61+
$./topics.py -l french -d train -f True -vvv
62+
To train the model on a french corpus:
63+
$./topics -l english -d train
64+
65+
66+
Note:
67+
----
68+
Take care of memory overloading in python multiprocessing. See option -m for details. ['french', 'english', 'spanish'..]: stemmer abbreviation for specified a langage.
69+
70+
71+
*******************
72+
graph.py
73+
*******************
74+
Resume:
75+
------
76+
This script is dedicated to plot 2 file.pdf to appreciate the right alpha parameter and the right number of topics for the corpus.
77+
78+
Required:
79+
--------
80+
'./fit_result.csv' file must be present in the current dirctory to run graph.py. See above (topics.py)to know how to generate it.
81+
82+
Files generated:
83+
--------------
84+
Num_topics_correlation.pdf and Alpha_correlation.pdf in the current directory.
85+
To have a better idea of log_perplexity (http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf)
86+
87+
88+
89+
*******************
90+
topics_visu.py
91+
*******************
92+
Resume:
93+
------
94+
In order to appreciate This script is dedicated to plot 2 file.pdf to appreciate the right alpha parameter and the right number of topics for the corpus.
95+
96+
Required:
97+
--------
98+
'./lda.model' file must be present in the current dirctory to run topics_visu.py. See above (topics.py)to know how to generate it.
99+
100+
Files generated:
101+
--------------
102+
'./output_file.txt' . This file is text file which represente a flat list of word per topics
103+
i file :'./WordCloud{i}.pdf' where i is the number of topics. It represents a word cloud of each topic. Size of word is proportional of probability of the word in the topic.
104+
105+
Notes:
106+
-----
107+
Be careful you generate as many pdf files as number of topics
108+
109+
110+
*******************
111+
Full pipeline
112+
*******************
113+
Resume:
114+
-----
115+
A full pipeline could be:
116+
- import a specific data for a specific platform and a specific langage: (import.py)
117+
- Fit parameters (long, long running time) with option -f: (topics.py)
118+
- Analysed result to select the best parameters: (graph.py)
119+
- Training model (long running time) with choosen parameters above (topics.py)
120+
- Vizualised and analysed topics and words by topics (topics_visu.py)
121+
122+

graph.py

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
#!/usr/bin/env python3
2+
3+
import pandas
4+
import matplotlib.pyplot as plt
5+
6+
df = pandas.read_csv('fit_result.csv')
7+
for i in df['parameter'].unique():
8+
x = df[df['parameter']==i]['topics']
9+
y = df[df['parameter']==i]['per_word-perplexity']
10+
plt.title('Parameter dependency')
11+
plt.xlabel('Number of Topics')
12+
#plt.ylim([-1, 20])
13+
plt.ylabel('Perplexity By word')
14+
plt.plot(x,y, linestyle='--', marker='o', label='alpha=%s' %i)
15+
plt.tick_params(axis='y', which='both', labelleft='off', labelright='on')
16+
plt.legend(loc='best')
17+
plt.savefig('Alpha_correlation.pdf', format='pdf')
18+
plt.close()
19+
20+
21+
22+
for i in df['topics'].unique():
23+
x = df[df['topics']==i]['parameter']
24+
y = df[df['topics']==i]['per_word-perplexity']
25+
plt.title('Topic dependency')
26+
plt.xlabel('parameter')
27+
#plt.ylim([-1, 20])
28+
plt.ylabel('Perplexity By word')
29+
plt.plot(x,y, linestyle='--', marker='o', label='topic=%s' %i)
30+
plt.tick_params(axis='y', which='both', labelleft='off', labelright='on')
31+
plt.legend(loc='best')
32+
plt.savefig('Num_topics_correlation.pdf', format='pdf')
33+

import.py

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
#!/usr/bin/env python3
2+
3+
__author__ ='morban'
4+
5+
import pysolr
6+
import os
7+
import sys
8+
import html
9+
import settings as s
10+
11+
import argparse
12+
13+
# argparse
14+
parser = argparse.ArgumentParser(description='Import data from solr. Written by Mathieu Orban.')
15+
parser.add_argument('-i','--index_list', nargs='+', help='If you want to pass a list of different index of year', default=None)
16+
parser.add_argument('-b','--begin', type=int, help='To give a started year to process')
17+
parser.add_argument('-e','--end', type=int, help='To give a ended year to process')
18+
parser.add_argument('-l','--lang',type=str, help='Langage specified')
19+
parser.add_argument('-q','--query',type=str, help='If you want to specify your own query')
20+
parser.add_argument('-p','--platform', type=str, help='type of platform selected', required=True)
21+
parser.add_argument('-d','--directory', type=str, help='directory where text will be saved', default='./train/')
22+
args = parser.parse_args()
23+
24+
25+
26+
def findNumFound(solr, request, fq=None):
27+
params = {'rows':0, 'fq':fq}
28+
results = solr.search(request, **params)
29+
return results.hits
30+
31+
def importCalenda(year):
32+
solr = pysolr.Solr(s.url_solr, timeout=20)
33+
request = 'platformID:{} AND yearFacet:{}'.format(args.platform, year)
34+
filter_queries = ['texte_{}:[* TO *]'.format(args.lang)]
35+
numFound = findNumFound(solr, request, fq=filter_queries)
36+
print(numFound)
37+
stop = numFound
38+
step =100
39+
# Get results by data bundle
40+
for i in range(0, stop, step):
41+
print(i)
42+
params = {'rows':step, 'start':i, 'sort':'id DESC', 'fq':filter_queries}
43+
results = solr.search(request, **params)
44+
print(len(results))
45+
list_files = saveInputFiles(results, year)
46+
47+
def saveInputFiles(results, year):
48+
year_directory = '{0}/{1}'.format(args.directory, year)
49+
if not os.path.exists(year_directory):
50+
os.makedirs(year_directory)
51+
for result in results:
52+
name_id = ''.join((result['id'].replace('http://','').replace('/','_'), '.txt'))
53+
write_path='{}/{}'.format(year_directory, name_id)
54+
if not os.path.exists('./{}'.format(write_path)):
55+
mode = 'a'
56+
else:
57+
mode = 'w'
58+
print('\t Processing %s' % write_path)
59+
naked_texte = html.unescape(result['texte_fr'])
60+
with open('{}'.format(write_path), mode) as f:
61+
f.write(naked_texte)
62+
63+
if __name__ == '__main__':
64+
year_list = args.index_list if args.index_list else list(range(args.begin, args.end))
65+
for i in year_list:
66+
year = i.__str__()
67+
print('Processing year : {}'.format(year))
68+
importCalenda(year)
69+

searcher.py

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
#!/usr/bin/env python3
2+
from gensim.matutils import Sparse2Corpus
3+
import time
4+
import random
5+
from gensim import models
6+
import numpy as np
7+
import csv
8+
9+
## @brief Class design to fit paramaters for lda
10+
class Searcher(object):
11+
grid = list()
12+
13+
# @brief Instantiate a Searcher object
14+
# @param ntopics :list. A list of different number of topics
15+
# @param params:list . A list of alpha parmaeters
16+
# @corpus :list A list which represente a vectorize document
17+
# @id2word: dict. A mapping between id and word
18+
def __init__(self, ntopics, params, corpus, id2word):
19+
self.ntopics=ntopics
20+
self.params=params
21+
self.corpus=corpus
22+
self.id2word=id2word
23+
24+
# @brief Split corpus in train and test
25+
# @return Tuple. A tuple of 2 corpus
26+
def shuffleCorpus(self):
27+
cp = list(self.corpus)
28+
random.shuffle(cp)
29+
# split into 80% training and 20% test sets
30+
p = int(len(cp) * .95)
31+
return (cp[0:p], cp[p:])
32+
33+
# @brief Display on a csv file the log perplixity for each parameter given
34+
def search(self):
35+
(cp_train, cp_test) = self.shuffleCorpus()
36+
number_of_words = sum(cnt for document in cp_test for _, cnt in document)
37+
grid = list()
38+
print('\t\tNumber of words in test corpus: %f\n' % number_of_words)
39+
for num_topic in self.ntopics:
40+
for param in self.params:
41+
print("\t\t**********************************\n \tStarting pass for %d topics and parameter_value = %.2f\n" % (num_topic, param))
42+
start_time = time.time()
43+
lda = models.LdaMulticore(corpus=cp_train, id2word=self.id2word, num_topics=num_topic, chunksize=2000, passes=1, alpha=param, eta=param)
44+
train_time = time.time() - start_time
45+
print('\tiTraining time: %s\n' % train_time)
46+
47+
start_time = time.time()
48+
perplex = lda.bound(cp_test)
49+
print('\tPerplexity: %4f\n' % perplex)
50+
51+
#per_word_perplex = np.exp2(-perplex / number_of_words)
52+
per_word_perplex = lda.log_perplexity(cp_test)
53+
print('\tPer-word Perplexity: %s\n' % per_word_perplex)
54+
55+
elapsed = time.time() - start_time
56+
print('\tPerplixity time: %s\n' % elapsed)
57+
58+
result = [num_topic, param, perplex, per_word_perplex, elapsed]
59+
grid.append(result)
60+
61+
with open('fit_result.csv', 'w', newline='') as f:
62+
writer = csv.writer(f)
63+
writer.writerow(('topics', 'parameter', 'perplexity', 'per_word-perplexity', 'time'))
64+
writer.writerows(grid)
65+
print('\tResult is saved in fit_result.csv\n')

settings.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
#path to solr source
2+
url_solr = <your solr url>

stop_calenda.txt

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
université
2+
colloques
3+
colloque
4+
séminaires
5+
séminaire
6+
l'université
7+
pause
8+
déjeuner
9+
professeur
10+
session
11+
journée
12+
discussion
13+
discussions
14+
table
15+
salle
16+
séance
17+
paris
18+

0 commit comments

Comments
 (0)