Flask for training/testing Watson, FastText, Gensen Embeddings and hDBScan. You can setup them in one of your development server and access them from all of your servers. This saves environment setup time and all of your team members can have access to best algorithm hyper parameters. Plus it supports MLFlow for model info logging.
In files, where required, replace system_ip
with your systemp IP address or localhost.
Right now following models are available:
- Gensen
- Watson[1]
- Fast Text[1]
- hDBScan
[1]. MLFlow is integrated for logging. See Test.ipynb for usage
conda create -n new_environment --file conda-req.txt
pip install -r pip-req.txt
Install MLflow. Then run it as from this repo root directory as `mlflow server -p 3457 -h 0.0.0.0 --backend-store-uri watson_mlruns/`
and `mlflow server -p 3456 -h 0.0.0.0 --backend-store-uri fasttext_mlruns/`
Official Repo: https://github.com/Maluuba/gensen
Paper: https://arxiv.org/abs/1804.00079
- Usage:
def f(sent): r = requests.post('http://{system_ip}:7654/get_embeddings/', json={'sentences_list': [sent]}) arr = np.asarray(json.loads(r.text)['vectors']).flatten() return arr data = pd.read_pickle('data/path') data['emb'] = data['text'].apply(f) data[['id','emb']].to_pickle('path_to_save/{f_name}.pkl')
Due to security issues, emailed:True
param doesn't email the results, but it stores them on disk and can be retrieved at watson/results
. See below for more.
Right now Watson endpoint offers 5 operations.
-
watson/
It trains new watson nlc and perform testing on test data once it is trained. Results will be saved on disk and can be accessed at
watson/Results
.-
Required Params:
- train_data: must have 'text', 'label' and 'id' col
- test_data: must have 'text', 'label' and 'id' col
- api_key
-
Optional Params:
-
model_name: by default it will be randomnly generated for the user.
-
ml_flow_params: (dict) to log the model in ML-flow
should have
experiment_name: For e.g "Text Classifier" run_name: for e.g "experiment v4" description: for e.g "using latest train data"
-
-
Usage:
params = {'train_data': data[['text','label','id']].to_dict(), 'test_data': data[['text','label','id']].to_dict(), 'model_name':'my-test_clf', 'api_key':''} r = requests.post('http://{system_ip}:7656/watson/', json=params) res = json.loads(r.text) print(res.keys())
-
-
watson/test
It gives prediction on passed data using trained nlc.
-
Required Params:
- test_data: must have 'text', 'label' and 'id' col
- api_key
- model_name: Trained nlc name
-
Optional Params:
-
emailed: (default=False) if True, test func. will execute in background, save results on disk and can be accessed at
watson/Results
.if False, client will wait for the results.
-
-
Usage:
params = {'test_data': data[['text','label','id']].to_dict(), 'model_name':'my-test_clf', 'api_key':''} r = requests.post('http://{system_ip}:7656/watson/test', json=params) res = json.loads(r.text) print(res.keys()) df_res, acc, report, report_txt, conf_mat = res['df_res'], res['acc'], res['report'], res['report_txt'], res['conf_mat'] df_res = pd.DataFrame(df_res) report = pd.DataFrame(report) conf_mat = pd.DataFrame(conf_mat) print(report_txt)
-
-
watson/results
Get results of test data stored on server.
-
Required Params:
- model_name: Trained nlc name
-
Optional Params:
-
delete_files: (default=False) if True, delete results from server after retrieving them.
if False, do nothing.
-
-
Usage:
params = {'model_name':'my-test_clf'} r = requests.post('http://{system_ip}:7656/watson/results', json=params) res = json.loads(r.text) print(res.keys())
-
-
watson/delete
Delete nlc from Watson instance.
-
Required Params:
- model_name: Trained nlc name
-
Usage:
params = {'model_name':'my-test_clf', 'api_key':''} r = requests.post('http://{system_ip}:7656/watson/delete', json=params) res = json.loads(r.text) print(res.keys())
-
-
watson/mlflow
Get URL of Ml-flow endpoint.
-
Usage:
r = requests.post('http://{system_ip}:7656/watson/mlflow') res = json.loads(r.text) print(res.keys())
-
Fast Text endpoint offers 4 operations.
-
fasttext/
It trains new fast text clf and perform testing on test data once it is trained.
-
Required Params:
- train_data: must have 'text', 'label' and 'id' col
- test_data: must have 'text', 'label' and 'id' col
-
Optional Params:
-
valid_data: must have 'text', 'label' and 'id' col
-
model_name: by default it will be randomnly generated for the user.
-
clf_params: user defined params
-
save_model: (default=False) Save model on server or not
-
ml_flow_params: (dict) to log the model in ML-flow
should have
experiment_name: For e.g "Text Classifier" run_name: for e.g "experiment v4" description: for e.g "using latest train data"
-
-
Usage:
params = {'train_data': data[['text','label','id']].to_dict(), 'test_data': data[['text','label','id']].to_dict(), 'model_name':'my-test_clf'} r = requests.post('http://{system_ip}:7655/fasttext/', json=params) res = json.loads(r.text) print(res.keys())
-
-
fasttext/test
It gives prediction on passed data if model was saved on server during training.
-
Required Params:
- test_data: must have 'text', 'label' and 'id' col
- model_name: Trained model name
-
Usage:
params = {'test_data': data[['text','label','id']].to_dict(), 'model_name':'my-test_clf'} r = requests.post('http://{system_ip}:7655/fasttext/test', json=params) res = json.loads(r.text) print(res.keys())
-
-
fasttext/delete
Delete model from server.
-
Required Params:
- model_name: Trained model name
-
Usage:
params = {'model_name':'my-test_clf'} r = requests.post('http://{system_ip}:7655/fasttext/delete', json=params) res = json.loads(r.text) print(res.keys())
-
-
fasttext/mlflow
Get URL of Ml-flow endpoint.
-
Usage:
r = requests.post('http://{system_ip}:7655/fasttext/mlflow') res = json.loads(r.text) print(res.keys())
-
-
hdbscan/
Perform hDBScan on given data.
-
Required Params:
-
train_data: must have 'emb' and 'id' col
if 'emb' is not available, then df must have 'text' col
-
-
Optional Params:
- clf_params: user defined params
-
Usage:
data['emb'] = data['emb'].apply(lambda x: x.tolist()) params = {'train_data': data[['emb' or 'text','id']].to_dict()} r = requests.post('http://{system_ip}:7657/hdbscan/', json=params) res = json.loads(r.text) print(res.keys()) df_res = res['df_res'] df_res = pd.DataFrame(df_res)
-