feat: implement doctrine-based fulltext search#2118
feat: implement doctrine-based fulltext search#2118benjaminfrueh wants to merge 8 commits intomainfrom
Conversation
| "symfony/string": "^6.0", | ||
| "symfony/translation-contracts": "^3.6", | ||
| "teamtnt/tntsearch": "^5.0" | ||
| "wamania/php-stemmer": "^4.0" |
There was a problem hiding this comment.
We use this one in a few other apps: https://github.com/search?q=org%3Anextcloud+wamania%2Fphp-stemmer&type=code
Maybe worth to consider scoping the dependency to avoid conflicts with different versions between apps. https://arthur-schiwon.de/isolating-nextcloud-app-dependencies-php-scoper
There was a problem hiding this comment.
Added dependency scoping in commit 83f65aa
07cbae6 to
d8a4b06
Compare
|
Hello there, We hope that the review process is going smooth and is helpful for you. We want to ensure your pull request is reviewed to your satisfaction. If you have a moment, our community management team would very much appreciate your feedback on your experience with this PR review process. Your feedback is valuable to us as we continuously strive to improve our community developer experience. Please take a moment to complete our short survey by clicking on the following link: https://cloud.nextcloud.com/apps/forms/s/i9Ago4EQRZ7TWxjfmeEpPkf6 Thank you for contributing to Nextcloud and we hope to hear from you soon! (If you believe you should not receive this message, you can add yourself to the blocklist.) |
mejo-
left a comment
There was a problem hiding this comment.
Really nice work, thanks so much @benjaminfrueh. I finally found time to go through the code changes, read up a bit on stemming, fuzzy searching, bigrams and other stuff I didn't know much about before 😆
The general approach of your implementation looks really clean and promising to me.
I have quite a few comments and questions and am curious what you think about them.
|
|
||
| public function stem(string $word): string { | ||
| if ($this->stemmer === null && $this->stemmingEnabled) { | ||
| $language = $this->config->getSystemValue('default_language', 'en'); |
There was a problem hiding this comment.
I'm not sure whether it's the best option to use the instance's default language here. This effectively means that stemming only happens for instance's default language? I guess this would be much more powerful if the language of the indexed document was used here, right?
Maybe there's simple algorithms to guess language from the full document in the indexer and pass the detected language into the stemmer? Nothing that necessarily needs to happen in this PR, but still I'd be curious about your thoughts.
There was a problem hiding this comment.
Language detection per document seems like the only proper solution, it comes with a small performance downside for indexing, but that should be fine. I found that there are language detection libraries, like https://github.com/patrickschur/language-detection which could be used, what do you think?
We should then store the language in the collectives_s_files table, so the correct stemmer can be used for each document.
There are possible edge-cases that document language change or a document has mixed languages, but we would just have to save the first one we detect.
There was a problem hiding this comment.
@mejo- I updated the PR and added language detection in this commit: 7e969f2
- added
patrickschur/language-detectionto detect the language of each document - added a
languagecolumn tocollectives_s_filesto store the detected language per file - updated
Stemmerto accept an optional$languageparameter and cache stemmers per language - updated
FileIndexerto detect and save language during indexing incollectives_s_files - detection during indexing is done on the first
2000characters, defined asLANGUAGE_DETECTION_LIMITto limit performance impact - updated
FileSearcherto search and stem all languages present in the collective, so search and stemming works with multilingual collectives
|
|
||
| $hitCountParam = $qb->createNamedParameter($hitCount, IQueryBuilder::PARAM_INT); | ||
| $qb->update($this->tableName) | ||
| ->set('num_hits', $qb->createFunction("num_hits - $hitCountParam")) |
There was a problem hiding this comment.
I wonder whether we want to save-guard against negative values here. Especially as the fields are unsigned integers, which means here's a risk of buffer overflows, right?
There was a problem hiding this comment.
Thanks for the feedback, updated it.
There was a problem hiding this comment.
I'm a bit unsure whether that really is enough to safeguard against negative num_hits and num_files values. If I understand it correctly, trying to put negative values into unsigned tables fields can result in buffer overflows. What was your motivation to make the integer/bigint fields unsigned in the first place? That we don't expect negative values here?
How about this to safeguard?
$qb->update($this->tableName)
->set('num_hits', $qb->func()->greatest($qb->createFunction("num_hits - $hitCountParam"), 0))
->set('num_files', $qb->func()->greatest($qb->createFunction('num_files - 1'), 0))
->where($qb->expr()->eq('circle_unique_id', $qb->createNamedParameter($circleUniqueId)))
->andWhere($qb->expr()->eq('id', $qb->createNamedParameter($wordId)));There was a problem hiding this comment.
Yes exactly, as num_hits and num_files are counts that semantically should never be negative, unsigned enforces that at the DB level. Of course this can lead to buffer overflows and as you say should be safeguarded.
My safeguard was preventing the decrement update if num_hits would underflow. Potentially I think it could leave stale data.
Wrapping it in a greatest() will still cause the overflow in MySQL/MariaDB, as it already happens in the subtraction, before the greatest() is evaluated. Can be tested with this query, which will cause a overflow error:
UPDATE oc_collectives_s_words SET num_files = GREATEST(num_files - 2, 0) WHERE num_files = 1;num_files should always be >= 1
I like the idea of just clamping the num_hits to 0 and num_files to 1 with a greatest() function, to be self-healing and cover all the data inconsistencies and edge cases we didn't think about now.
So maybe we should change the columns to signed and then use a greatest function, as this is only enforced in MySQL/MariaDB anyways as far as I know. This seems more safe, what do you think?
There was a problem hiding this comment.
Sounds like a good plain 😊 Let's use signed columns for num_hits and num_files and clamp the values using the greatest() function 👌
There was a problem hiding this comment.
I updated this in commit b9a3c57
All integers columns are now signed, for overflow safety and also to be consistent with the other collectives tables. The num_hits and num_files are now clamped to 0 using greatest().
The deleteOrphanedWords() function already deletes words with num_hits OR num_files <= 0 and is called right after the decrement, so this should be good against 0 and negative values here anyways.
d8a4b06 to
b26823e
Compare
b26823e to
2ee83e2
Compare
2ee83e2 to
46319ae
Compare
Signed-off-by: Benjamin Frueh <benjamin.frueh@gmail.com>
Signed-off-by: Jonas <jonas@freesources.org>
Signed-off-by: Jonas <jonas@freesources.org>
46319ae to
bfd5805
Compare
No longer needed with new search backend. Signed-off-by: Jonas <jonas@freesources.org>
Signed-off-by: Jonas <jonas@freesources.org>
bfd5805 to
14cf7c2
Compare
Signed-off-by: Benjamin Frueh <benjamin.frueh@gmail.com>
55c2713 to
7e969f2
Compare
Signed-off-by: Benjamin Frueh <benjamin.frueh@gmail.com>
7e969f2 to
3474638
Compare
Signed-off-by: Benjamin Frueh <benjamin.frueh@gmail.com>
6e9453d to
83f65aa
Compare
📝 Summary
Replaces TNTSearch with nextcloud database full-text search using doctrine.
Resolves #2050
Todo
php-stemmerdependency usingphp-scoper🏁 Checklist
npm run lint/npm run stylelint/composer run cs:check)