Information-Retrieval-2

Building Weighted TF-IDF Model, Introducing Spell Error Correction and Improving Efficiency of Model.

1. ir_2.py - code file
2. documents.txt - contains the documents 
3. query.txt - queries provided for the documents
4. relevance_assessment.txt - relevant documents for each query from query.txt file to calculate precision and recall recall metrics

In ir_2.py (code file) -

Task 1 Weighted TF-IDF Model

(a) Implement a weighted TF-IDF model M f ull for I f ull . Rank the documents as per their relevance score. Display the top 20 results (decreasing order of score) in form of a table. Format:

Query ID | Doc IDs | Scores

123 | 1, 2, 5, 6 | 0.7, 0.66, 0.56, 0.4

(b) Improve the top-K precision of your model. Plot a grouped bar plot for top-K precision and recall of M P . X axis has query ID, Y axis has scores and legends represent the Precision and Recall. Plot three subplots for K = 2, 5, and 10. If number of results are lesser than 10 then plot the result for maximum value available.

Task 2 Spell Error Correction

(a) Induce minimum three spell errors randomly in your query words: i) W1: delete one character, ii) W2: insert and delete one character, and iii) W3: swap two random non-adjacent characters. Display the potential correct term based on the Levenshtein edit distance score for each query. You can define a threshold for selecting the term. Cost of insertion = deletion = 1, substitution = 2. It should not be too less or too large. Result format:

Query ID | Typo | Term | Distance

123 | Infotmarion | Information | 4

(b) Using the same threshold as Task 2a, improve the top K precision for your queries. K = 5 and 10. Refer this model to as M spell . Display results in a suitable format.

(c) Using the same set of queries from Task 2a, Display the potential correct term based on the QWERTY edit distance score for each query. Assume that the first character of every word in query is correct.

(d) Using the same threshold as Task 2a, and updated queries after addressing QWERTY typographical errors, improve the top K precision for your queries. K = 5 and 10. Refer this model to as M qwerty .

Task 3 Efficiency Model

(a) Display the time (in seconds) taken by M f ull , M P , M spell , and M qwerty for each respective query.

(b) Implement a heuristic to improve the efficiency of your models M f ull , M P , M spell , and M qwerty , i.e., reduce the time taken by each model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Information-Retrieval-2

Building Weighted TF-IDF Model, Introducing Spell Error Correction and Improving Efficiency of Model.

Task 1 Weighted TF-IDF Model

Task 2 Spell Error Correction

Task 3 Efficiency Model

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
documents.txt		documents.txt
ir_2.py		ir_2.py
query.txt		query.txt
relevance_assessment.txt		relevance_assessment.txt

amoeba11/Information-Retrieval-2

Folders and files

Latest commit

History

Repository files navigation

Information-Retrieval-2

Building Weighted TF-IDF Model, Introducing Spell Error Correction and Improving Efficiency of Model.

Task 1 Weighted TF-IDF Model

Task 2 Spell Error Correction

Task 3 Efficiency Model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages