Initialized github commit

morban · morban · commit 950d34f007fd · 2017-07-13T14:20:16.000+02:00
diff --git a/INSTALL.txt b/INSTALL.txt
@@ -0,0 +1,33 @@
+Each script can be manipulated on his own, you could therefore install dependencies SEPARATELY. It is suggested, notably because some dependency could be heavy and specially if you don(t use all the scripts.  
+
+
+General Dependencies:
+====================
+
+	python3, python3-pip
+
+
+Dependencies for import.py:
+==========================-
+	pip3 install pysolr
+
+Dependencies for topics.py:
+===========================
+	pip3 install gensim, nltk
+
+	To get appropriate tokenizer, you need to download punkt stemmer (http://www.nltk.org/_modules/nltk/tokenize/punkt.html):
+	In a shell:
+	$python3
+	>>>import nltk
+	>>>nltk.download()
+	Downloader>d
+		identifier> punkt
+	
+Dependencies for graph.py:
+=========================
+	pip3 install pandas, matplotlib
+	apt-get install python3-tk
+	
+Dependencies for topics_visu.py:
+===================================
+	pip3 install wordcloud
diff --git a/LICENSE.md b/LICENSE.md
diff --git a/README.md b/README.md
@@ -0,0 +1,24 @@
+#toolpic : Topic Modelling
+
+
+toolpic is a toolkit for process topics modelling. Written in python, it is based on unsupervised learning, latent dirichlet allocation (LDA). This algorithm is implemented in gensim (https://radimrehurek.com/gensim/index.html). 4 open source scripts managed the full process. The fisrt one (import.py) is for extract and import text from openedition database (according to published years and given langages). The second one (topic.py) clean/stem text in order to fit training parameters or to train directly the topic model. The third one (graph.py) is just a plot to see dependency of parameter. The last one (topic_visualisation.py) allow to visualize each topic (two format available).
+
+
+##Contributors
+
+Mathieu Orban.
+
+##Installation
+
+See INSTALL.txt
+
+##Usage
+
+See README.txt
+
+##Licence
+
+toolpic is released under the terms of the GNU AFFERO GENERAL PUBLIC LICENSE
+
+##Documentation
+
diff --git a/README.txt b/README.txt
@@ -0,0 +1,122 @@
+USAGE:
+=====
+	For any script you can get more details options with:
+	$python3 <script.py> -h
+
+
+
+*******************
+   import.py 
+*******************
+	
+	Resume:
+	------
+	This script is dedicated to import full text from solr by platform (required option), by selected years and by langage.
+	Each text is saved in one file. The file is saved under a directory of its year.
+
+ 
+	settings :
+	---------
+		You need first to specify your solr url in the settings.py
+
+	Exemple:
+	-------
+		To import french text for revues.org published in 2005 and 2007:  
+		$python3 import.py -p RO -i 2005 2007 -l fr -d mydirectory
+	
+	Note:
+	----
+		The -q --query option is not yet implemented. ['fr', 'en', 'es'..]: OpenEdition abbreviation for specified a langage.
+
+
+******************
+   topics.py
+******************
+	Resume:
+	------
+	This script is dedicated to:
+		-Get a corpus directory.
+		-Clean and/or stem each documents of this corpus and fit this corpus for gensim format.
+		-Bag of word model
+		-TfIdf transformation.
+		-LDA Model (in multiprocess) is running (according to seyting option) to:
+			-Fit alpha parameter and number of topics. A csv file is generated to evaluate the log_perplexity.
+			-Generate a Topic Model
+
+
+	Files generated:
+	--------------
+		GENSIM FORMAT TEXT: '/tmp/gensim_docs.txt' is generated. A big file.txt  where one line is on document. (format required for gensim) 
+		LOG: './lda_model.log' (by default) is always saved on the current directory (logging python library)
+		FIT PARAMETERS: (when option -f=True): './fit_result.csv' is saved on a current directory. This file could be used for visualisation in graph.py
+		TRAIN MODEL: (option -f=False; default option):  './lda.model.expElogbeta.npy', './lda.model.id2word', './lda.model.state', './lda.model' is saved on current directory. ONLY 'lda.model' could be used for data visualisation in topic_visualisation.py
+
+	dictionary:
+	---------
+		By default a stop_tartarus.txt (http://snowball.tartarus.org/algorithms/french/stemmer.html) anda stop_calenda.txt is furnished. This a list of stop word
+
+	Exemple:
+	-------
+		To fit the best parameter with gensim info log for french text (documents is under directory train):  
+		$./topics.py -l french -d train -f True -vvv
+		To train the model on a french corpus:
+		$./topics -l english -d train
+
+	
+	Note:
+	----
+		Take care of memory overloading in python multiprocessing. See option -m for details. ['french', 'english', 'spanish'..]: stemmer abbreviation for specified a langage.
+
+
+*******************
+   graph.py 
+*******************
+Resume:
+------
+This script is dedicated to plot 2 file.pdf to appreciate the right alpha parameter and the right number of topics for the corpus. 
+
+	Required:
+	--------
+ 		'./fit_result.csv' file must be present in the current dirctory to run graph.py. See above (topics.py)to know how to generate it.
+
+	Files generated:
+	--------------
+		Num_topics_correlation.pdf and Alpha_correlation.pdf in the current directory.
+		To have a better idea of log_perplexity (http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf)  
+
+
+
+*******************
+   topics_visu.py 
+*******************
+Resume:
+------
+In order to appreciate This script is dedicated to plot 2 file.pdf to appreciate the right alpha parameter and the right number of topics for the corpus. 
+
+	Required:
+	--------
+ 		'./lda.model' file must be present in the current dirctory to run topics_visu.py. See above (topics.py)to know how to generate it.
+
+	Files generated:
+	--------------
+		'./output_file.txt' . This file is text file which represente a flat list of word per topics
+		i file :'./WordCloud{i}.pdf' where i is the number of topics. It represents a word cloud of each topic. Size of word is proportional of probability of the word in the topic.
+
+	Notes:
+	-----
+		Be careful you generate as many pdf files as number of topics
+
+
+*******************
+   Full pipeline
+*******************
+Resume:
+-----
+	A full pipeline could be:
+		- import a specific data for a specific platform and a specific langage: (import.py)
+		- Fit parameters (long, long running time) with option -f: (topics.py)
+		- Analysed result to select the best parameters: (graph.py)
+		- Training model (long running time) with choosen parameters above (topics.py)
+		- Vizualised and analysed topics and words by topics (topics_visu.py)
+
+
diff --git a/graph.py b/graph.py
@@ -0,0 +1,33 @@
+#!/usr/bin/env python3
+
+import pandas
+import matplotlib.pyplot as plt
+
+df = pandas.read_csv('fit_result.csv')
+for i in df['parameter'].unique():
+    x = df[df['parameter']==i]['topics']
+    y = df[df['parameter']==i]['per_word-perplexity']
+    plt.title('Parameter dependency')
+    plt.xlabel('Number of Topics')
+    #plt.ylim([-1, 20])
+    plt.ylabel('Perplexity By word')
+    plt.plot(x,y, linestyle='--', marker='o', label='alpha=%s' %i)
+    plt.tick_params(axis='y', which='both', labelleft='off', labelright='on')
+    plt.legend(loc='best')
+plt.savefig('Alpha_correlation.pdf', format='pdf')
+plt.close()
+
+
+
+for i in df['topics'].unique():
+    x = df[df['topics']==i]['parameter']
+    y = df[df['topics']==i]['per_word-perplexity']
+    plt.title('Topic dependency')
+    plt.xlabel('parameter')
+    #plt.ylim([-1, 20])
+    plt.ylabel('Perplexity By word')
+    plt.plot(x,y, linestyle='--', marker='o', label='topic=%s' %i)
+    plt.tick_params(axis='y', which='both', labelleft='off', labelright='on')
+    plt.legend(loc='best')
+plt.savefig('Num_topics_correlation.pdf', format='pdf')
+
diff --git a/import.py b/import.py
@@ -0,0 +1,69 @@
+#!/usr/bin/env python3
+
+__author__ ='morban'
+
+import pysolr
+import os
+import sys
+import html
+import settings as s
+
+import argparse
+
+# argparse
+parser = argparse.ArgumentParser(description='Import data from solr. Written by Mathieu Orban.')
+parser.add_argument('-i','--index_list', nargs='+', help='If you want to pass a list of different index of year', default=None)
+parser.add_argument('-b','--begin', type=int, help='To give a started year to process')
+parser.add_argument('-e','--end', type=int, help='To give a ended year to process')
+parser.add_argument('-l','--lang',type=str, help='Langage specified')
+parser.add_argument('-q','--query',type=str, help='If you want to specify your own query')
+parser.add_argument('-p','--platform', type=str, help='type of platform selected', required=True)
+parser.add_argument('-d','--directory', type=str, help='directory where text will be saved', default='./train/')
+args = parser.parse_args()
+
+
+
+def findNumFound(solr, request, fq=None):
+    params = {'rows':0, 'fq':fq}
+    results = solr.search(request, **params)
+    return results.hits
+
+def importCalenda(year):
+    solr = pysolr.Solr(s.url_solr, timeout=20)
+    request = 'platformID:{} AND yearFacet:{}'.format(args.platform, year)
+    filter_queries = ['texte_{}:[* TO *]'.format(args.lang)]
+    numFound = findNumFound(solr, request, fq=filter_queries)
+    print(numFound)
+    stop = numFound
+    step =100
+    # Get results by data bundle
+    for i in range(0, stop, step):
+        print(i)
+        params = {'rows':step, 'start':i, 'sort':'id DESC', 'fq':filter_queries}
+        results = solr.search(request, **params)
+        print(len(results))
+        list_files = saveInputFiles(results, year)
+
+def saveInputFiles(results, year):
+    year_directory = '{0}/{1}'.format(args.directory, year)
+    if not os.path.exists(year_directory):
+        os.makedirs(year_directory)
+    for result in results:
+        name_id = ''.join((result['id'].replace('http://','').replace('/','_'), '.txt'))
+        write_path='{}/{}'.format(year_directory, name_id)
+        if not os.path.exists('./{}'.format(write_path)):
+            mode = 'a'
+        else:
+            mode = 'w'
+        print('\t Processing %s' % write_path)
+        naked_texte = html.unescape(result['texte_fr'])
+        with open('{}'.format(write_path), mode) as f:
+            f.write(naked_texte)
+
+if __name__ == '__main__':
+    year_list = args.index_list if args.index_list else list(range(args.begin, args.end))
+    for i in year_list:
+        year = i.__str__()
+        print('Processing year : {}'.format(year))
+        importCalenda(year)
+
diff --git a/searcher.py b/searcher.py
@@ -0,0 +1,65 @@
+#!/usr/bin/env python3
+from gensim.matutils import Sparse2Corpus
+import time
+import random
+from gensim import models
+import numpy as np
+import csv
+
+## @brief Class design to fit paramaters for lda
+class Searcher(object):
+    grid = list()
+    
+    # @brief Instantiate a Searcher object
+    # @param ntopics :list. A list of different number of topics
+    # @param params:list . A list of alpha parmaeters
+    # @corpus :list A list which represente a vectorize document
+    # @id2word: dict. A mapping between id and word
+    def __init__(self, ntopics, params, corpus, id2word):
+        self.ntopics=ntopics
+        self.params=params
+        self.corpus=corpus
+        self.id2word=id2word
+
+    # @brief Split corpus in train and test
+    # @return Tuple. A tuple of 2 corpus
+    def shuffleCorpus(self):
+        cp = list(self.corpus)
+        random.shuffle(cp)
+        # split into 80% training and 20% test sets
+        p = int(len(cp) * .95)
+        return (cp[0:p], cp[p:])
+
+    # @brief Display on a csv file the log perplixity for each parameter given
+    def search(self):
+        (cp_train, cp_test) = self.shuffleCorpus()
+        number_of_words = sum(cnt for document in cp_test for _, cnt in document)
+        grid = list()
+        print('\t\tNumber of words in test corpus: %f\n' % number_of_words)
+        for num_topic in self.ntopics:
+            for param in self.params:
+                print("\t\t**********************************\n \tStarting pass for %d topics and  parameter_value = %.2f\n" % (num_topic, param))
+                start_time = time.time()
+                lda = models.LdaMulticore(corpus=cp_train, id2word=self.id2word, num_topics=num_topic, chunksize=2000, passes=1, alpha=param, eta=param)
+                train_time = time.time() - start_time
+                print('\tiTraining time: %s\n' % train_time)
+                
+                start_time = time.time()
+                perplex = lda.bound(cp_test)
+                print('\tPerplexity: %4f\n' % perplex)
+                
+                #per_word_perplex = np.exp2(-perplex / number_of_words)
+                per_word_perplex = lda.log_perplexity(cp_test)
+                print('\tPer-word Perplexity: %s\n' % per_word_perplex)
+                
+                elapsed = time.time() - start_time
+                print('\tPerplixity time: %s\n' % elapsed)
+
+                result = [num_topic, param, perplex, per_word_perplex, elapsed]
+                grid.append(result)
+       
+        with open('fit_result.csv', 'w', newline='') as f:
+            writer = csv.writer(f)
+            writer.writerow(('topics', 'parameter', 'perplexity', 'per_word-perplexity', 'time'))
+            writer.writerows(grid)
+            print('\tResult is saved in fit_result.csv\n')
diff --git a/settings.py b/settings.py
@@ -0,0 +1,2 @@
+#path to solr source
+url_solr = <your solr url>
diff --git a/stop_calenda.txt b/stop_calenda.txt
@@ -0,0 +1,18 @@
+université
+colloques
+colloque
+séminaires
+séminaire
+l'université
+pause
+déjeuner
+professeur
+session
+journée
+discussion
+discussions
+table
+salle
+séance
+paris
+
diff --git a/stop_tartarus.txt b/stop_tartarus.txt
diff --git a/topic_visualisation.py b/topic_visualisation.py
diff --git a/topics.py b/topics.py

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+#path to solr source`
	`2`	`+url_solr = <your solr url>`