Skip to content

feat: implement doctrine-based fulltext search#2118

Open
benjaminfrueh wants to merge 8 commits intomainfrom
feat/doctrine-fulltext-search
Open

feat: implement doctrine-based fulltext search#2118
benjaminfrueh wants to merge 8 commits intomainfrom
feat/doctrine-fulltext-search

Conversation

@benjaminfrueh
Copy link
Copy Markdown
Contributor

@benjaminfrueh benjaminfrueh commented Nov 11, 2025

📝 Summary

Replaces TNTSearch with nextcloud database full-text search using doctrine.

Resolves #2050

Todo

  • Implement language detection in stemmer
  • Scope php-stemmer dependency using php-scoper
  • Remove documentation about extra tntsearch dependencies
  • Migrate e2e FTS tests to Playwright
  • Follow-up: add Arabic stemmer

🏁 Checklist

  • Code is properly formatted (npm run lint / npm run stylelint / composer run cs:check)
  • Sign-off message is added to all commits
  • Tests (unit, integration and/or end-to-end) passing and the changes are covered with tests
  • Documentation (README or documentation) has been updated or is not required

"symfony/string": "^6.0",
"symfony/translation-contracts": "^3.6",
"teamtnt/tntsearch": "^5.0"
"wamania/php-stemmer": "^4.0"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use this one in a few other apps: https://github.com/search?q=org%3Anextcloud+wamania%2Fphp-stemmer&type=code

Maybe worth to consider scoping the dependency to avoid conflicts with different versions between apps. https://arthur-schiwon.de/isolating-nextcloud-app-dependencies-php-scoper

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added dependency scoping in commit 83f65aa

@benjaminfrueh benjaminfrueh force-pushed the feat/doctrine-fulltext-search branch from 07cbae6 to d8a4b06 Compare November 11, 2025 22:28
@github-actions
Copy link
Copy Markdown
Contributor

Hello there,
Thank you so much for taking the time and effort to create a pull request to our Nextcloud project.

We hope that the review process is going smooth and is helpful for you. We want to ensure your pull request is reviewed to your satisfaction. If you have a moment, our community management team would very much appreciate your feedback on your experience with this PR review process.

Your feedback is valuable to us as we continuously strive to improve our community developer experience. Please take a moment to complete our short survey by clicking on the following link: https://cloud.nextcloud.com/apps/forms/s/i9Ago4EQRZ7TWxjfmeEpPkf6

Thank you for contributing to Nextcloud and we hope to hear from you soon!

(If you believe you should not receive this message, you can add yourself to the blocklist.)

@github-project-automation github-project-automation bot moved this to 🧭 Planning evaluation (don't pick) in 📝 Productivity team Mar 5, 2026
@benjaminfrueh benjaminfrueh moved this from 🧭 Planning evaluation (don't pick) to 🏗️ In progress in 📝 Productivity team Mar 5, 2026
Copy link
Copy Markdown
Member

@mejo- mejo- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work, thanks so much @benjaminfrueh. I finally found time to go through the code changes, read up a bit on stemming, fuzzy searching, bigrams and other stuff I didn't know much about before 😆

The general approach of your implementation looks really clean and promising to me.

I have quite a few comments and questions and am curious what you think about them.


public function stem(string $word): string {
if ($this->stemmer === null && $this->stemmingEnabled) {
$language = $this->config->getSystemValue('default_language', 'en');
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether it's the best option to use the instance's default language here. This effectively means that stemming only happens for instance's default language? I guess this would be much more powerful if the language of the indexed document was used here, right?

Maybe there's simple algorithms to guess language from the full document in the indexer and pass the detected language into the stemmer? Nothing that necessarily needs to happen in this PR, but still I'd be curious about your thoughts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Language detection per document seems like the only proper solution, it comes with a small performance downside for indexing, but that should be fine. I found that there are language detection libraries, like https://github.com/patrickschur/language-detection which could be used, what do you think?

We should then store the language in the collectives_s_files table, so the correct stemmer can be used for each document.

There are possible edge-cases that document language change or a document has mixed languages, but we would just have to save the first one we detect.

Copy link
Copy Markdown
Contributor Author

@benjaminfrueh benjaminfrueh Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mejo- I updated the PR and added language detection in this commit: 7e969f2

  • added patrickschur/language-detection to detect the language of each document
  • added a language column to collectives_s_filesto store the detected language per file
  • updated Stemmer to accept an optional $language parameter and cache stemmers per language
  • updated FileIndexerto detect and save language during indexing in collectives_s_files
  • detection during indexing is done on the first 2000 characters, defined as LANGUAGE_DETECTION_LIMIT to limit performance impact
  • updated FileSearcher to search and stem all languages present in the collective, so search and stemming works with multilingual collectives


$hitCountParam = $qb->createNamedParameter($hitCount, IQueryBuilder::PARAM_INT);
$qb->update($this->tableName)
->set('num_hits', $qb->createFunction("num_hits - $hitCountParam"))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we want to save-guard against negative values here. Especially as the fields are unsigned integers, which means here's a risk of buffer overflows, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, updated it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit unsure whether that really is enough to safeguard against negative num_hits and num_files values. If I understand it correctly, trying to put negative values into unsigned tables fields can result in buffer overflows. What was your motivation to make the integer/bigint fields unsigned in the first place? That we don't expect negative values here?

How about this to safeguard?

$qb->update($this->tableName)
	->set('num_hits', $qb->func()->greatest($qb->createFunction("num_hits - $hitCountParam"), 0))
	->set('num_files', $qb->func()->greatest($qb->createFunction('num_files - 1'), 0))
	->where($qb->expr()->eq('circle_unique_id', $qb->createNamedParameter($circleUniqueId)))
	->andWhere($qb->expr()->eq('id', $qb->createNamedParameter($wordId)));

Copy link
Copy Markdown
Contributor Author

@benjaminfrueh benjaminfrueh Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, as num_hits and num_files are counts that semantically should never be negative, unsigned enforces that at the DB level. Of course this can lead to buffer overflows and as you say should be safeguarded.

My safeguard was preventing the decrement update if num_hits would underflow. Potentially I think it could leave stale data.

Wrapping it in a greatest() will still cause the overflow in MySQL/MariaDB, as it already happens in the subtraction, before the greatest() is evaluated. Can be tested with this query, which will cause a overflow error:

UPDATE oc_collectives_s_words SET num_files = GREATEST(num_files - 2, 0) WHERE num_files = 1;

num_files should always be >= 1

I like the idea of just clamping the num_hits to 0 and num_files to 1 with a greatest() function, to be self-healing and cover all the data inconsistencies and edge cases we didn't think about now.

So maybe we should change the columns to signed and then use a greatest function, as this is only enforced in MySQL/MariaDB anyways as far as I know. This seems more safe, what do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good plain 😊 Let's use signed columns for num_hits and num_files and clamp the values using the greatest() function 👌

Copy link
Copy Markdown
Contributor Author

@benjaminfrueh benjaminfrueh Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this in commit b9a3c57

All integers columns are now signed, for overflow safety and also to be consistent with the other collectives tables. The num_hits and num_files are now clamped to 0 using greatest().

The deleteOrphanedWords() function already deletes words with num_hits OR num_files <= 0 and is called right after the decrement, so this should be good against 0 and negative values here anyways.

@mejo- mejo- force-pushed the feat/doctrine-fulltext-search branch from d8a4b06 to b26823e Compare March 8, 2026 18:42
@benjaminfrueh benjaminfrueh force-pushed the feat/doctrine-fulltext-search branch from b26823e to 2ee83e2 Compare March 19, 2026 19:01
@mejo- mejo- force-pushed the feat/doctrine-fulltext-search branch from 2ee83e2 to 46319ae Compare March 24, 2026 15:48
benjaminfrueh and others added 3 commits March 31, 2026 12:41
Signed-off-by: Benjamin Frueh <benjamin.frueh@gmail.com>
Signed-off-by: Jonas <jonas@freesources.org>
Signed-off-by: Jonas <jonas@freesources.org>
@mejo- mejo- force-pushed the feat/doctrine-fulltext-search branch from 46319ae to bfd5805 Compare March 31, 2026 11:45
mejo- added 2 commits March 31, 2026 14:24
No longer needed with new search backend.

Signed-off-by: Jonas <jonas@freesources.org>
Signed-off-by: Jonas <jonas@freesources.org>
@mejo- mejo- force-pushed the feat/doctrine-fulltext-search branch from bfd5805 to 14cf7c2 Compare March 31, 2026 14:07
Signed-off-by: Benjamin Frueh <benjamin.frueh@gmail.com>
@benjaminfrueh benjaminfrueh force-pushed the feat/doctrine-fulltext-search branch from 55c2713 to 7e969f2 Compare April 1, 2026 15:26
Signed-off-by: Benjamin Frueh <benjamin.frueh@gmail.com>
@benjaminfrueh benjaminfrueh force-pushed the feat/doctrine-fulltext-search branch from 7e969f2 to 3474638 Compare April 1, 2026 15:41
@benjaminfrueh benjaminfrueh marked this pull request as ready for review April 1, 2026 15:47
Signed-off-by: Benjamin Frueh <benjamin.frueh@gmail.com>
@benjaminfrueh benjaminfrueh force-pushed the feat/doctrine-fulltext-search branch from 6e9453d to 83f65aa Compare April 1, 2026 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: 🏗️ In progress

Development

Successfully merging this pull request may close these issues.

feat: Use Nextcloud database for TNTSearch

3 participants