A tool for indexing and searching PDFs.
Special thanks to friends and lainons.
- outline all requirements
- outline generic algorithm dataflow
- model data based on needs
- implement database funcionalities ...
- check for empty query
- check for duplicates
- update duplicates
- add database version and header
pdftotext
(poppler-utils
)
pdftotext
appends ^L
at the end of every page.
This can be used to point the page of the best result and other results.
Indexing:
- Take all documents and build all frequencies we need.
- Compose the frequencies to get weights for each term for each document.
- Frequencies are indexed by term.
- Saved to db
Querying:
- Query is weighted and compose a query vector.
- Frequencies retrieved and compose a document vector.
- Calculate cos(θ) for each document and rank it (highest is best).
- Display results in a 20 result per page.
- Cache results.
Caching:
- Cache is invalidated when a new document is inserted or removed from the library.
The following interfaces are required:
insert-document
insert-term
remove-document
remove-term
To facilitate this, the data-structure must have:
- the number of documents;
- for each term, the number of documents it appers in;
- for each document, each term frequencies.
This is sufficient to allow:
- indexing new documents without evaluating every other document;
- querying a vector of terms and comparing the affinity.
It is sufficent to map the query terms into a vector of weights based on tf-idf. Then transposing the matrix and applying the cosine theorem we have the affinity of the documents based on the the query. The ranking result should keep track of the documents while sorting.
Depending on search method we can choose between some datastructures to count occurrencies of terms:
- hashmap are good when we lookup for exact terms;
- I don't know how good could be the SQLite Index;
- DAFSA are good for building a fuzzable search for each term (see lunr.js TokenSet).
No. Using sqlite to update the term count is simply painful. It takes more than a minute to add a document because of this.