Copyright 2019 Megagon Labs
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
OpineDB is a subjective database engine for extracting, aggregating, and querying subjective data. See our paper for more details.
All experimental codes are written in Python 3. To install the required packages:
pip install -r requirements.txt
To download the datasets, simply do
cd data/
make
The datasets consist of a hotel review dataset from booking.com and a restaurant review dataset from Yelp. More specifically, our experiment considers 4 subsets of the two datasets: hotels in Amsterdam, hotels in London < $300 per night, low price restaurants in Toronto, and japanese restaurants in Toronto.
The files within each dataset are listed in the configuration files data/$city/config.json for $city in amsterdam, london, and toronto. For example, in data/amsterdam/config.json:
{
"s3_link" : "https://s3.us-east-2.amazonaws.com/yelp-opine/new_extractor/simple_opine/amsterdam_hotels.zip",
"zip_fn" : "amsterdam_hotels.zip",
"entity_fn" : "raw_hotels.json",
"raw_review_fn" : "raw_reviews.csv",
"extraction_fn" : "amsterdam_reviews_with_extractions.json",
"all_reviews_fn" : "all_reviews.json",
"queries_fn" : "hotel_queries.txt",
"histogram_fn" : "entities_with_histograms.json",
"sentiment_output_fn" : "sentiment.json",
"word2vec_fn" : "word2vec.model",
"idf_fn": "idf.json",
"labels_fn" : "labels.json"
}
- the field
s3_linkis the download link of the dataset, entity_fnis the JSON file containing the entity info,raw_review_fnis a csv file of the raw text reviews,extraction_fnis the file of reviews with the extracted opinions (see the extractor section below for how to run the extraction pipeline),all_reviews_fnis a (large enough) list of review text for training the word2vec model,queries_fnis a list of crowd-source subjective query predicates,histogram_fnis a JSON file containing the marker aggregates computed usingutil/generate_markers.py,sentiment_output_fncontains the normalized average sentiment of each extracted phrase,word2vec_fnis the Word2Vec model trained fromall_reviews_fn,idf_fnis the IDF (inverse document frequency) of each token in the Word2Vec model, andlabels_fnis the JSON file containing all the (entity, predicate) labels for evaluation.
The make command calls the script util/generate_markers.py for generating the files histogram_fn, sentiment_output_fn, word2vec_fn, and idf_fn.
- To run one round of the query result quality experiment on one set of entities:
python eval/evaluate.py amsterdam
The keyword amsterdam can be replaced with london, toronto_lp, or toronto_jp.
We also provide a python script eval/run_all.py to run all the experiments with 10 repetitions. Simply run:
python eval/run_all.py
python eval/read_results.py
The read_results.py script will print out the data for the two quality-related tables in the original paper.
- To run the query interpreter experiments, one can use the script
eval/eval_interpreter.py:
python eval/eval_interpreter.py retrain hotel
where the keyword retrain can be replaced with read_result to read the experimental results and the hotel keyword can be replaced with restaurant to produce the results on restaurants.
See instructions in extractor/run_extractor.ipynb. The pipeline was used for generating the *_reviews_with_extractions.json files in the downloaded datasets.
See instructions in Section 2 of extractor/run_extractor.ipynb.
Note: The current docker configuration supports Ubuntu only.
$ sudo apt install docker docker-compose
$ cd sql
$ bash run.sh
>>> from opine import SimpleOpine
>>> opine = SimpleOpine()
>>> sql = """
... SELECT h.name
... FROM hotel_amsterdam AS h
... WHERE h.opine = 'very clean room'
... AND h.price <= 15
... AND h.opine = 'helpful staff'
... AND h.opine = 'romantic'
... """
>>> print(opine.opine_sql(sql))