feat: implement doctrine-based fulltext search by benjaminfrueh · Pull Request #2118 · nextcloud/collectives

benjaminfrueh · 2025-11-11T14:28:36Z

📝 Summary

Replaces TNTSearch with nextcloud database full-text search using doctrine.

Resolves #2050

Todo

Implement language detection in stemmer
Scope php-stemmer dependency using php-scoper
Remove documentation about extra tntsearch dependencies
Migrate e2e FTS tests to Playwright
Follow-up: add Arabic stemmer

🏁 Checklist

Code is properly formatted (npm run lint / npm run stylelint / composer run cs:check)
Sign-off message is added to all commits
Tests (unit, integration and/or end-to-end) passing and the changes are covered with tests
Documentation (README or documentation) has been updated or is not required

juliusknorr · 2025-11-11T21:04:26Z

composer.json

        "symfony/string": "^6.0",
        "symfony/translation-contracts": "^3.6",
-        "teamtnt/tntsearch": "^5.0"
+        "wamania/php-stemmer": "^4.0"


We use this one in a few other apps: https://github.com/search?q=org%3Anextcloud+wamania%2Fphp-stemmer&type=code

Maybe worth to consider scoping the dependency to avoid conflicts with different versions between apps. https://arthur-schiwon.de/isolating-nextcloud-app-dependencies-php-scoper

Added dependency scoping in commit 83f65aa

github-actions · 2025-11-26T02:11:04Z

Hello there,
Thank you so much for taking the time and effort to create a pull request to our Nextcloud project.

We hope that the review process is going smooth and is helpful for you. We want to ensure your pull request is reviewed to your satisfaction. If you have a moment, our community management team would very much appreciate your feedback on your experience with this PR review process.

Your feedback is valuable to us as we continuously strive to improve our community developer experience. Please take a moment to complete our short survey by clicking on the following link: https://cloud.nextcloud.com/apps/forms/s/i9Ago4EQRZ7TWxjfmeEpPkf6

Thank you for contributing to Nextcloud and we hope to hear from you soon!

(If you believe you should not receive this message, you can add yourself to the blocklist.)

mejo-

Really nice work, thanks so much @benjaminfrueh. I finally found time to go through the code changes, read up a bit on stemming, fuzzy searching, bigrams and other stuff I didn't know much about before 😆

The general approach of your implementation looks really clean and promising to me.

I have quite a few comments and questions and am curious what you think about them.

lib/Search/FileSearch/Db/SearchDoc.php

lib/Search/FileSearch/Db/SearchDocMapper.php

mejo- · 2026-03-05T14:06:49Z

lib/Search/FileSearch/Stemmer/Stemmer.php

+
+	public function stem(string $word): string {
+		if ($this->stemmer === null && $this->stemmingEnabled) {
+			$language = $this->config->getSystemValue('default_language', 'en');


I'm not sure whether it's the best option to use the instance's default language here. This effectively means that stemming only happens for instance's default language? I guess this would be much more powerful if the language of the indexed document was used here, right?

Maybe there's simple algorithms to guess language from the full document in the indexer and pass the detected language into the stemmer? Nothing that necessarily needs to happen in this PR, but still I'd be curious about your thoughts.

Language detection per document seems like the only proper solution, it comes with a small performance downside for indexing, but that should be fine. I found that there are language detection libraries, like https://github.com/patrickschur/language-detection which could be used, what do you think?

We should then store the language in the collectives_s_files table, so the correct stemmer can be used for each document.

There are possible edge-cases that document language change or a document has mixed languages, but we would just have to save the first one we detect.

@mejo- I updated the PR and added language detection in this commit: 7e969f2

added patrickschur/language-detection to detect the language of each document

added a language column to collectives_s_filesto store the detected language per file

updated Stemmer to accept an optional $language parameter and cache stemmers per language

updated FileIndexerto detect and save language during indexing in collectives_s_files

detection during indexing is done on the first 2000 characters, defined as LANGUAGE_DETECTION_LIMIT to limit performance impact

updated FileSearcher to search and stem all languages present in the collective, so search and stemming works with multilingual collectives

lib/Search/FileSearch/Stemmer/Stemmer.php

lib/Search/FileSearch/Tokenizer/WordTokenizer.php

lib/Search/FileSearch/FileIndexer.php

mejo- · 2026-03-05T15:18:34Z

lib/Search/FileSearch/Db/SearchWordMapper.php

+
+		$hitCountParam = $qb->createNamedParameter($hitCount, IQueryBuilder::PARAM_INT);
+		$qb->update($this->tableName)
+			->set('num_hits', $qb->createFunction("num_hits - $hitCountParam"))


I wonder whether we want to save-guard against negative values here. Especially as the fields are unsigned integers, which means here's a risk of buffer overflows, right?

Thanks for the feedback, updated it.

I'm a bit unsure whether that really is enough to safeguard against negative num_hits and num_files values. If I understand it correctly, trying to put negative values into unsigned tables fields can result in buffer overflows. What was your motivation to make the integer/bigint fields unsigned in the first place? That we don't expect negative values here?

How about this to safeguard?

$qb->update($this->tableName) ->set('num_hits', $qb->func()->greatest($qb->createFunction("num_hits - $hitCountParam"), 0)) ->set('num_files', $qb->func()->greatest($qb->createFunction('num_files - 1'), 0)) ->where($qb->expr()->eq('circle_unique_id', $qb->createNamedParameter($circleUniqueId))) ->andWhere($qb->expr()->eq('id', $qb->createNamedParameter($wordId)));

Yes exactly, as num_hits and num_files are counts that semantically should never be negative, unsigned enforces that at the DB level. Of course this can lead to buffer overflows and as you say should be safeguarded.

My safeguard was preventing the decrement update if num_hits would underflow. Potentially I think it could leave stale data.

Wrapping it in a greatest() will still cause the overflow in MySQL/MariaDB, as it already happens in the subtraction, before the greatest() is evaluated. Can be tested with this query, which will cause a overflow error:

UPDATE oc_collectives_s_words SET num_files = GREATEST(num_files - 2, 0) WHERE num_files = 1;

num_files should always be >= 1

I like the idea of just clamping the num_hits to 0 and num_files to 1 with a greatest() function, to be self-healing and cover all the data inconsistencies and edge cases we didn't think about now.

So maybe we should change the columns to signed and then use a greatest function, as this is only enforced in MySQL/MariaDB anyways as far as I know. This seems more safe, what do you think?

Sounds like a good plain 😊 Let's use signed columns for num_hits and num_files and clamp the values using the greatest() function 👌

I updated this in commit b9a3c57

All integers columns are now signed, for overflow safety and also to be consistent with the other collectives tables. The num_hits and num_files are now clamped to 0 using greatest().

The deleteOrphanedWords() function already deletes words with num_hits OR num_files <= 0 and is called right after the decrement, so this should be good against 0 and negative values here anyways.

lib/Search/FileSearch/FileSearcher.php

lib/Search/FileSearch/Tokenizer/ClauseTokenizer.php

lib/Search/FileSearch/FileSearcher.php

Signed-off-by: Benjamin Frueh <benjamin.frueh@gmail.com>

Signed-off-by: Jonas <jonas@freesources.org>

No longer needed with new search backend. Signed-off-by: Jonas <jonas@freesources.org>

Signed-off-by: Jonas <jonas@freesources.org>

Signed-off-by: Benjamin Frueh <benjamin.frueh@gmail.com>

benjaminfrueh added technical debt 2. developing enhancement New feature or request labels Nov 11, 2025

juliusknorr reviewed Nov 11, 2025

View reviewed changes

benjaminfrueh force-pushed the feat/doctrine-fulltext-search branch from 07cbae6 to d8a4b06 Compare November 11, 2025 22:28

github-actions bot added the feedback-requested label Nov 26, 2025

mejo- mentioned this pull request Feb 26, 2026

Allow to filter pages by mentioned user #2299

Open

benjaminfrueh added this to 📝 Productivity team Mar 5, 2026

github-project-automation bot moved this to 🧭 Planning evaluation (don't pick) in 📝 Productivity team Mar 5, 2026

benjaminfrueh moved this from 🧭 Planning evaluation (don't pick) to 🏗️ In progress in 📝 Productivity team Mar 5, 2026

mejo- reviewed Mar 5, 2026

View reviewed changes

mejo- force-pushed the feat/doctrine-fulltext-search branch from d8a4b06 to b26823e Compare March 8, 2026 18:42

benjaminfrueh force-pushed the feat/doctrine-fulltext-search branch from b26823e to 2ee83e2 Compare March 19, 2026 19:01

mejo- force-pushed the feat/doctrine-fulltext-search branch from 2ee83e2 to 46319ae Compare March 24, 2026 15:48

benjaminfrueh and others added 3 commits March 31, 2026 12:41

feat: implement doctrine-based fulltext search

bb4bca2

Signed-off-by: Benjamin Frueh <benjamin.frueh@gmail.com>

fix(PageController): search result property changed to file_id

fb7924d

Signed-off-by: Jonas <jonas@freesources.org>

test: migrate unified search tests from Cypress to Playwright

d576172

Signed-off-by: Jonas <jonas@freesources.org>

mejo- force-pushed the feat/doctrine-fulltext-search branch from 46319ae to bfd5805 Compare March 31, 2026 11:45

mejo- added 2 commits March 31, 2026 14:24

chore: remove dependency on pdo-sqlite

90c176e

No longer needed with new search backend. Signed-off-by: Jonas <jonas@freesources.org>

test(PageContentProviderTest): fix mocked search result

14cf7c2

Signed-off-by: Jonas <jonas@freesources.org>

mejo- force-pushed the feat/doctrine-fulltext-search branch from bfd5805 to 14cf7c2 Compare March 31, 2026 14:07

fix: make integer colums signed and clamp num_hits and num_files to 0

b9a3c57

Signed-off-by: Benjamin Frueh <benjamin.frueh@gmail.com>

benjaminfrueh force-pushed the feat/doctrine-fulltext-search branch from 55c2713 to 7e969f2 Compare April 1, 2026 15:26

feat: add language detection for fulltextsearch

3474638

Signed-off-by: Benjamin Frueh <benjamin.frueh@gmail.com>

benjaminfrueh force-pushed the feat/doctrine-fulltext-search branch from 7e969f2 to 3474638 Compare April 1, 2026 15:41

benjaminfrueh marked this pull request as ready for review April 1, 2026 15:47

benjaminfrueh requested review from max-nextcloud and silverkszlo as code owners April 1, 2026 15:47

chore: scope dependencies with php-scoper

83f65aa

Signed-off-by: Benjamin Frueh <benjamin.frueh@gmail.com>

benjaminfrueh force-pushed the feat/doctrine-fulltext-search branch from 6e9453d to 83f65aa Compare April 1, 2026 19:17

Conversation

benjaminfrueh commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📝 Summary

Todo

🏁 Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

mejo- left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benjaminfrueh Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benjaminfrueh Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benjaminfrueh Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

benjaminfrueh commented Nov 11, 2025 •

edited

Loading

benjaminfrueh Apr 1, 2026 •

edited

Loading

benjaminfrueh Mar 31, 2026 •

edited

Loading

benjaminfrueh Apr 1, 2026 •

edited

Loading