feat(search): implement FST5 w/ sqlite for faster and better searching #6839

perfectra1n · 2025-08-30T20:40:20Z

No description provided.

feat(search): don't limit the number of blobs to put in virtual tables fix(search): improve FTS triggers to handle all SQL operations correctly The root cause of FTS index issues during import was that database triggers weren't properly handling all SQL operations, particularly upsert operations (INSERT ... ON CONFLICT ... DO UPDATE) that are commonly used during imports. Key improvements: - Fixed INSERT trigger to handle INSERT OR REPLACE operations - Updated UPDATE trigger to fire on ANY change (not just specific columns) - Improved blob triggers to use INSERT OR REPLACE for atomic updates - Added proper handling for notes created before their blobs (import scenario) - Added triggers for protection state changes - All triggers now use LEFT JOIN to handle missing blobs gracefully This ensures the FTS index stays synchronized even when: - Entity events are disabled during import - Notes are re-imported (upsert operations) - Blobs are deduplicated across notes - Notes are created before their content blobs The solution works entirely at the database level through triggers, removing the need for application-level workarounds. fix(search): consolidate FTS trigger fixes into migration 234 - Merged improved trigger logic from migration 235 into 234 - Deleted unnecessary migration 235 since DB version is still 234 - Ensures triggers handle all SQL operations (INSERT OR REPLACE, upserts) - Fixes FTS indexing for imported notes by handling missing blobs - Schema.sql and migration 234 now have identical trigger implementations

apps/server/src/services/search/fts_search.ts

+
+            // Build snippet extraction if requested
+            const snippetSelect = includeSnippets 
+                ? `, snippet(notes_fts, ${FTS_CONFIG.SNIPPET_COLUMN_CONTENT}, '${highlightTag}', '${highlightTag.replace('<', '</')}', '...', ${snippetLength}) as snippet`


To fix the problem, ensure the replacement of < in highlightTag reliably creates the correct closing tag. For HTML tags like <b>, replacing the first < with </ is fine, but it is safer and clearer to use string manipulation that does not risk accidentally leaving multiple or malformed tags. The best approach is to construct the closing tag explicitly by inserting a / after the initial <. This can be achieved by checking if highlightTag starts with < and simply generating the closing tag as '<\/' + highlightTag.slice(1). Alternatively, if you want to perform a global replacement for future-proofing, use replace(/</g, '</'). In our context, to minimize code change and avoid accidental misbehavior, construct the closing tag explicitly.

All changes are within the file apps/server/src/services/search/fts_search.ts, specifically line 256.

apps/server/src/services/search/fts_search.ts

This reverts commit b09a2c3.

This reverts commit 7c5553b.

… later" This reverts commit 37d0136.

This reverts commit 5b79e0d.

…ents" This reverts commit 053f722.

apps/server/src/migrations/0235__sqlite_native_search.ts

+    if (!html) return '';
+
+    // Remove script and style content entirely first
+    let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');


The most effective way to address this problem, without altering the external behavior or adding extra dependencies, is to replace the script removal regex in a loop: repeatedly remove all <script> tags until none remain. This ensures that, even if one replacement exposes a new script tag (due to overlapping or nested tags), all script blocks are completely removed. We should use a do...while loop that keeps replacing script tags as long as they are present. This change should be applied in the stripHtmlTags function, specifically on the line that currently executes the single-pass .replace() for script tags (line 75). No further imports or method changes are needed, and no logic elsewhere is affected.

apps/server/src/migrations/0235__sqlite_native_search.ts

+    if (!html) return '';
+
+    // Remove script and style content entirely first
+    let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');


The best fix is to replace the custom regular expressions meant to strip script and style content with a well-tested HTML parser/sanitizer. For Node.js/TypeScript, the sanitize-html or cheerio package are both popular and robust. Since only shown one file (apps/server/src/migrations/0235__sqlite_native_search.ts), and changes are allowed only here, we'll use sanitize-html, configured to remove all script and style content and strip all tags, yielding only text content.

Modification steps:

Import sanitize-html at the top of the file.

Replace lines 74-92 in stripHtmlTags with a call to sanitize-html configured to remove all tags and output plain text. This handles script/style reliably.

The call to stripTags from utils will be redundant and can be removed, since sanitize-html replaces its functionality with better coverage.

HTML entity decoding ( , etc.) can still be applied after sanitization if absolutely needed, but sanitize-html already decodes entities unless configured otherwise.

Only lines within the shown file will be touched.

apps/server/src/migrations/0235__sqlite_native_search.ts

+
+    // Remove script and style content entirely first
+    let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
+    text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');


The best fix is to ensure that all occurrences of <style>...</style> blocks are fully stripped, even if multiple or nested/malformed instances occur (as attackers might trigger incomplete removal). To do this without changing existing functionality and with minimal impact, repeatedly apply the regex replacements for script/style blocks until the string no longer changes. This ensures that any re-exposed tags are also removed. The change should only affect the two regex replace calls for <script> and <style> in stripHtmlTags() in apps/server/src/migrations/0235__sqlite_native_search.ts. No extra dependencies are needed, only a small refactor of these lines into a loop.

apps/server/src/migrations/0235__sqlite_native_search.ts

+    text = text.replace(/&nbsp;/g, ' ');
+    text = text.replace(/&lt;/g, '<');
+    text = text.replace(/&gt;/g, '>');
+    text = text.replace(/&amp;/g, '&');


To fix the double-unescaping bug, the replacements for HTML entities must be performed in the correct order: all replacements except & should occur first, and replace & last. In this code, that means re-ordering lines 83–87 so that .replace(/&/g, '&') is last, after the other entity decoding steps.

Make this change in the stripHtmlTags function in apps/server/src/migrations/0235__sqlite_native_search.ts:

Move text = text.replace(/&/g, '&'); to after the replacements for " and '.
No imports or definitions are required; only a change in order of these replacement lines. No additional libraries are necessary.

apps/server/src/services/search/sqlite_functions.ts

+
+            // First remove script and style content entirely (including the tags)
+            // This needs to happen before stripTags to remove the content
+            text = text.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');


The best way to fix this is to ensure repeated removal is performed until all script/style tags are fully stripped, regardless of how input mutates during replacement. This can be achieved by using a loop to apply the .replace() until there are no more replacements. Alternatively, use a robust sanitization library such as sanitize-html, but since we're limited to shown code and not introducing major codebase changes, repeated replacement is the most direct and non-breaking fix. Concretely, modify the lines where <script> and <style> tags are removed to repeatedly apply the regex replacements until a fixpoint is reached (nothing more is removed). This must all be done within apps/server/src/services/search/sqlite_functions.ts inside the stripHtml method.

No new imports are needed. Only the removal logic for script/style content must change.

apps/server/src/services/search/sqlite_functions.ts

+            // Decode common HTML entities
+            text = text.replace(/&lt;/g, '<');
+            text = text.replace(/&gt;/g, '>');
+            text = text.replace(/&amp;/g, '&');


To fix this issue, reorder the entity unescaping code so that the ampersand entity (&) is decoded last, after all other decodings. This avoids the double unescape problem: if &quot; is present, it remains &quot; while other entities are decoded, until the final pass where & is decoded to &.

Specifically, in apps/server/src/services/search/sqlite_functions.ts, lines 418–424 contain a sequence of .replace() calls for decoding HTML entities. Move the line text = text.replace(/&/g, '&'); (currently line 420) so that it comes immediately after all other .replace() calls for <, >, ", ', ', and  . The order should be: unescape all non-ampersand entities first, then unescape ampersand.

No additional imports or logic are needed.

apps/server/src/services/search/sqlite_search_utils.ts

+    }
+
+    // Remove script and style content
+    let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');


The best fix is to repeatedly apply the regular expressions to remove <script> and <style> tags until no further matches are found, ensuring that malicious input cannot survive through incomplete multi-character replacements. This will guarantee that all possible instances, including those emerging after each replacement, are eliminated. The fix should be limited to the processHtmlContent function in this file. No code outside that function will be edited. No new dependencies are strictly required for this fix, but if the project would prefer to use a well-tested HTML sanitization library (like sanitize-html), that would be even better; however, per instructions, I will stick to the repeated replacement method inside the code region provided.

apps/server/src/services/search/sqlite_search_utils.ts

+    }
+
+    // Remove script and style content
+    let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');


The best way to fix this is to avoid using regex for HTML sanitization and instead use a well-known HTML parser or sanitization library. If you must use regex for a very specific use-case, at least make your regex robust against browser quirks. In this context, since you're already using a utility function called stripTags, try to utilize a well-established library such as sanitize-html or dompurify via Node, but per the instruction only use code you've been shown. Therefore, update the regex on line 119 to more robustly match closing <script> tags with optional whitespace and attributes (so it matches tags like </script > or </script foo="bar">). A good improvement is to match </script\b[^>]*> (case-insensitive).

Edit line 119 in apps/server/src/services/search/sqlite_search_utils.ts:

Replace the regex <script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script> with <script\b[^<]*(?:(?!<\/script[\s>])<[^<]*)*<\/script[\s\S]*?>, which is more robust against browser quirks and will match closing </script>, </script >, and </script foo="bar">.

Similarly, update the <style> removal regex on line 120.

No new imports or method definitions are required, unless you decide to use an external sanitization library (which would require a dependency), but we restrict ourselves to just improving the regex as per instructions.

apps/server/src/services/search/sqlite_search_utils.ts

+
+    // Remove script and style content
+    let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
+    text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');


General Fix:
Ensure that all <style> tags and their contents are completely removed, even if new matches arise after each replacement. One straightforward solution is to reapply the replacement repeatedly until no further changes occur.

Best Specific Fix:
Modify the code so that the .replace(<style...>) (and likewise the <script...> removal) are applied inside a loop that continues until the string no longer changes. This can be implemented with a do...while loop for each pattern.

Region to Change:
Lines 119–120 in processHtmlContent, inside apps/server/src/services/search/sqlite_search_utils.ts.

Required Methods/Imports:
No new methods or imports are needed; only local logic changes.

apps/server/src/services/search/sqlite_search_utils.ts

+    text = text.replace(/&nbsp;/g, ' ');
+    text = text.replace(/&lt;/g, '<');
+    text = text.replace(/&gt;/g, '>');
+    text = text.replace(/&amp;/g, '&');


The underlying problem is the ordering of HTML entity replacements in the "unescape" (decode) section of the processHtmlContent function. To fix this, reorder the string replacement statements so that text.replace(/&/g, '&') is performed after all other entity replacements. This ensures that entities like &quot; are converted first to " (in their intended single-escape form), and only then to (") when " is replaced. Only modify the order of these lines: lines 126 to 132, so that & is last.

No new imports or methods are needed, as the code is doing simple string replacements.

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Aug 30, 2025

github-advanced-security bot found potential problems Aug 30, 2025

View reviewed changes

perfectra1n added 2 commits August 30, 2025 20:48

feat(search): also fix tests for new fts functionality

21aaec2

feat(search): try to get fts search to work in large environments

053f722

github-advanced-security bot found potential problems Aug 31, 2025

View reviewed changes

apps/server/src/services/search/fts_search.ts Fixed Show fixed Hide fixed

perfectra1n added 3 commits August 30, 2025 22:30

feat(search): try to decrease complexity

5b79e0d

feat(search): try to deal with huge dbs, might need to squash later

37d0136

feat(search): further improve fts search

7c5553b

perfectra1n marked this pull request as draft September 2, 2025 05:08

perfectra1n added 7 commits September 1, 2025 22:29

feat(search): I honestly have no idea what I'm doing

b09a2c3

Revert "feat(search): I honestly have no idea what I'm doing"

8572f82

This reverts commit b09a2c3.

Revert "feat(search): further improve fts search"

f529ddc

This reverts commit 7c5553b.

Revert "feat(search): try to deal with huge dbs, might need to squash…

0afb8a1

… later" This reverts commit 37d0136.

Revert "feat(search): try to decrease complexity"

06b2d71

This reverts commit 5b79e0d.

Revert "feat(search): try to get fts search to work in large environm…

d074841

…ents" This reverts commit 053f722.

feat(search): try a ground-up sqlite search approach

58c2252

github-advanced-security bot found potential problems Sep 3, 2025

View reviewed changes

@@ -72,7 +72,12 @@
                 if (!html) return '';
                 // Remove script and style content entirely first
-                let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
+                let text = html;
+                // Remove all <script>...</script> blocks, repeatedly, to prevent incomplete matching
+                do {
+                    var oldText = text;
+                    text = text.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
+                } while (text !== oldText);
                 text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
                 // Use utils stripTags for consistency

@@ -14,8 +14,9 @@
             import sql from "../services/sql.js";
             import log from "../services/log.js";
-            import { normalize as utilsNormalize, stripTags } from "../services/utils.js";
+            import { normalize as utilsNormalize } from "../services/utils.js";
             import { getSqliteFunctionsService } from "../services/search/sqlite_functions.js";
+            import sanitizeHtml from "sanitize-html";
             /**
              * Uses the existing normalize function from utils.ts for consistency
@@ -71,24 +71,23 @@
             function stripHtmlTags(html: string): string {
                 if (!html) return '';
-                // Remove script and style content entirely first
-                let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
-                text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
-                // Use utils stripTags for consistency
-                text = stripTags(text);
-                // Decode HTML entities
+                // Use sanitize-html for robust HTML tag and script/style stripping
+                let text = sanitizeHtml(html, {
+                    allowedTags: [],       // remove all tags (including script/style)
+                    allowedAttributes: {}, // remove all attributes
+                    textFilter: function(text) {
+                        return text;
+                    }
+                });
+                // Normalize whitespace and decode common HTML entities if needed
                 text = text.replace(/&nbsp;/g, ' ');
                 text = text.replace(/&lt;/g, '<');
                 text = text.replace(/&gt;/g, '>');
                 text = text.replace(/&amp;/g, '&');
                 text = text.replace(/&quot;/g, '"');
                 text = text.replace(/&#39;/g, "'");
-                // Normalize whitespace
                 text = text.replace(/\s+/g, ' ').trim();
                 return text;
             }

@@ -4,7 +4,8 @@
               "description": "The server-side component of TriliumNext, which exposes the client via the web, allows for sync and provides a REST API for both internal and external use.",
               "private": true,
               "dependencies": {
-                "better-sqlite3": "12.2.0"
+                "better-sqlite3": "12.2.0",
+                "sanitize-html": "^2.17.0"
               },
               "devDependencies": {
                 "@electron/remote": "2.1.3",

Package	Version	Security advisories
sanitize-html (npm)	2.17.0	None

@@ -72,8 +72,14 @@
                 if (!html) return '';
                 // Remove script and style content entirely first
-                let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
-                text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
+                let text = html;
+                // Repeatedly remove <script> and <style> blocks until there are none left (to prevent incomplete multi-character sanitization)
+                let prev;
+                do {
+                    prev = text;
+                    text = text.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
+                    text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
+                } while (text !== prev);
                 // Use utils stripTags for consistency
                 text = stripTags(text);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(search): implement FST5 w/ sqlite for faster and better searching #6839

feat(search): implement FST5 w/ sqlite for faster and better searching #6839

Uh oh!

perfectra1n commented Aug 30, 2025

Uh oh!

Check failure

Copilot Autofix

Uh oh!

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Uh oh!

@@ -115,9 +115,19 @@
                     return '';
                 }
-                // Remove script and style content
-                let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
-                text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
+                // Remove script and style content completely, applying multiple times if necessary
+                let text = html;
+                let prevText;
+                // Remove all <script>...</script> tags
+                do {
+                    prevText = text;
+                    text = text.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
+                } while (text !== prevText);
+                // Remove all <style>...</style> tags
+                do {
+                    prevText = text;
+                    text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
+                } while (text !== prevText);
                 // Strip remaining tags
                 text = stripTags(text);

Uh oh!

feat(search): implement FST5 w/ sqlite for faster and better searching #6839

Are you sure you want to change the base?

feat(search): implement FST5 w/ sqlite for faster and better searching #6839

Uh oh!

Conversation

perfectra1n commented Aug 30, 2025

Uh oh!

Check failure

Copilot Autofix

Uh oh!

Check failure

Uh oh!

Copilot Autofix

Check failure

Copilot Autofix

Check failure

Uh oh!

Copilot Autofix

Check failure

Uh oh!

Copilot Autofix

Check failure

Uh oh!

Copilot Autofix

Check failure

Uh oh!

Copilot Autofix

Check failure

Uh oh!

Copilot Autofix

Check failure

Copilot Autofix

Check failure

Uh oh!

Copilot Autofix

Check failure

Uh oh!

Copilot Autofix

Uh oh!