Skip to content

Conversation

perfectra1n
Copy link
Member

No description provided.

feat(search): don't limit the number of blobs to put in virtual tables

fix(search): improve FTS triggers to handle all SQL operations correctly

The root cause of FTS index issues during import was that database triggers
weren't properly handling all SQL operations, particularly upsert operations
(INSERT ... ON CONFLICT ... DO UPDATE) that are commonly used during imports.

Key improvements:
- Fixed INSERT trigger to handle INSERT OR REPLACE operations
- Updated UPDATE trigger to fire on ANY change (not just specific columns)
- Improved blob triggers to use INSERT OR REPLACE for atomic updates
- Added proper handling for notes created before their blobs (import scenario)
- Added triggers for protection state changes
- All triggers now use LEFT JOIN to handle missing blobs gracefully

This ensures the FTS index stays synchronized even when:
- Entity events are disabled during import
- Notes are re-imported (upsert operations)
- Blobs are deduplicated across notes
- Notes are created before their content blobs

The solution works entirely at the database level through triggers,
removing the need for application-level workarounds.

fix(search): consolidate FTS trigger fixes into migration 234

- Merged improved trigger logic from migration 235 into 234
- Deleted unnecessary migration 235 since DB version is still 234
- Ensures triggers handle all SQL operations (INSERT OR REPLACE, upserts)
- Fixes FTS indexing for imported notes by handling missing blobs
- Schema.sql and migration 234 now have identical trigger implementations
@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Aug 30, 2025

// Build snippet extraction if requested
const snippetSelect = includeSnippets
? `, snippet(notes_fts, ${FTS_CONFIG.SNIPPET_COLUMN_CONTENT}, '${highlightTag}', '${highlightTag.replace('<', '</')}', '...', ${snippetLength}) as snippet`

Check failure

Code scanning / CodeQL

Incomplete string escaping or encoding High

This replaces only the first occurrence of '<'.

Copilot Autofix

AI 18 days ago

To fix the problem, ensure the replacement of < in highlightTag reliably creates the correct closing tag. For HTML tags like <b>, replacing the first < with </ is fine, but it is safer and clearer to use string manipulation that does not risk accidentally leaving multiple or malformed tags. The best approach is to construct the closing tag explicitly by inserting a / after the initial <. This can be achieved by checking if highlightTag starts with < and simply generating the closing tag as '<\/' + highlightTag.slice(1). Alternatively, if you want to perform a global replacement for future-proofing, use replace(/</g, '</'). In our context, to minimize code change and avoid accidental misbehavior, construct the closing tag explicitly.

All changes are within the file apps/server/src/services/search/fts_search.ts, specifically line 256.


Suggested changeset 1
apps/server/src/services/search/fts_search.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/apps/server/src/services/search/fts_search.ts b/apps/server/src/services/search/fts_search.ts
--- a/apps/server/src/services/search/fts_search.ts
+++ b/apps/server/src/services/search/fts_search.ts
@@ -253,7 +253,7 @@
 
             // Build snippet extraction if requested
             const snippetSelect = includeSnippets 
-                ? `, snippet(notes_fts, ${FTS_CONFIG.SNIPPET_COLUMN_CONTENT}, '${highlightTag}', '${highlightTag.replace('<', '</')}', '...', ${snippetLength}) as snippet`
+                ? `, snippet(notes_fts, ${FTS_CONFIG.SNIPPET_COLUMN_CONTENT}, '${highlightTag}', '${highlightTag.replace(/^</, '</')}', '...', ${snippetLength}) as snippet`
                 : '';
 
             const query = `
EOF
@@ -253,7 +253,7 @@

// Build snippet extraction if requested
const snippetSelect = includeSnippets
? `, snippet(notes_fts, ${FTS_CONFIG.SNIPPET_COLUMN_CONTENT}, '${highlightTag}', '${highlightTag.replace('<', '</')}', '...', ${snippetLength}) as snippet`
? `, snippet(notes_fts, ${FTS_CONFIG.SNIPPET_COLUMN_CONTENT}, '${highlightTag}', '${highlightTag.replace(/^</, '</')}', '...', ${snippetLength}) as snippet`
: '';

const query = `
Copilot is powered by AI and may make mistakes. Always verify output.
@perfectra1n perfectra1n marked this pull request as draft September 2, 2025 05:08
if (!html) return '';

// Remove script and style content entirely first
let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');

Check failure

Code scanning / CodeQL

Incomplete multi-character sanitization High

This string may still contain
<script
, which may cause an HTML element injection vulnerability.

Copilot Autofix

AI 15 days ago

The most effective way to address this problem, without altering the external behavior or adding extra dependencies, is to replace the script removal regex in a loop: repeatedly remove all <script> tags until none remain. This ensures that, even if one replacement exposes a new script tag (due to overlapping or nested tags), all script blocks are completely removed. We should use a do...while loop that keeps replacing script tags as long as they are present. This change should be applied in the stripHtmlTags function, specifically on the line that currently executes the single-pass .replace() for script tags (line 75). No further imports or method changes are needed, and no logic elsewhere is affected.


Suggested changeset 1
apps/server/src/migrations/0235__sqlite_native_search.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/apps/server/src/migrations/0235__sqlite_native_search.ts b/apps/server/src/migrations/0235__sqlite_native_search.ts
--- a/apps/server/src/migrations/0235__sqlite_native_search.ts
+++ b/apps/server/src/migrations/0235__sqlite_native_search.ts
@@ -72,7 +72,12 @@
     if (!html) return '';
     
     // Remove script and style content entirely first
-    let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
+    let text = html;
+    // Remove all <script>...</script> blocks, repeatedly, to prevent incomplete matching
+    do {
+        var oldText = text;
+        text = text.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
+    } while (text !== oldText);
     text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
     
     // Use utils stripTags for consistency
EOF
@@ -72,7 +72,12 @@
if (!html) return '';

// Remove script and style content entirely first
let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
let text = html;
// Remove all <script>...</script> blocks, repeatedly, to prevent incomplete matching
do {
var oldText = text;
text = text.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
} while (text !== oldText);
text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');

// Use utils stripTags for consistency
Copilot is powered by AI and may make mistakes. Always verify output.
if (!html) return '';

// Remove script and style content entirely first
let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');

Check failure

Code scanning / CodeQL

Bad HTML filtering regexp High

This regular expression does not match script end tags like </script >.

Copilot Autofix

AI 15 days ago

The best fix is to replace the custom regular expressions meant to strip script and style content with a well-tested HTML parser/sanitizer. For Node.js/TypeScript, the sanitize-html or cheerio package are both popular and robust. Since only shown one file (apps/server/src/migrations/0235__sqlite_native_search.ts), and changes are allowed only here, we'll use sanitize-html, configured to remove all script and style content and strip all tags, yielding only text content.

Modification steps:

  1. Import sanitize-html at the top of the file.
  2. Replace lines 74-92 in stripHtmlTags with a call to sanitize-html configured to remove all tags and output plain text. This handles script/style reliably.
  3. The call to stripTags from utils will be redundant and can be removed, since sanitize-html replaces its functionality with better coverage.
  4. HTML entity decoding (&nbsp;, etc.) can still be applied after sanitization if absolutely needed, but sanitize-html already decodes entities unless configured otherwise.

Only lines within the shown file will be touched.


Suggested changeset 2
apps/server/src/migrations/0235__sqlite_native_search.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/apps/server/src/migrations/0235__sqlite_native_search.ts b/apps/server/src/migrations/0235__sqlite_native_search.ts
--- a/apps/server/src/migrations/0235__sqlite_native_search.ts
+++ b/apps/server/src/migrations/0235__sqlite_native_search.ts
@@ -14,8 +14,9 @@
 
 import sql from "../services/sql.js";
 import log from "../services/log.js";
-import { normalize as utilsNormalize, stripTags } from "../services/utils.js";
+import { normalize as utilsNormalize } from "../services/utils.js";
 import { getSqliteFunctionsService } from "../services/search/sqlite_functions.js";
+import sanitizeHtml from "sanitize-html";
 
 /**
  * Uses the existing normalize function from utils.ts for consistency
@@ -71,24 +71,23 @@
 function stripHtmlTags(html: string): string {
     if (!html) return '';
     
-    // Remove script and style content entirely first
-    let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
-    text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
-    
-    // Use utils stripTags for consistency
-    text = stripTags(text);
-    
-    // Decode HTML entities
+    // Use sanitize-html for robust HTML tag and script/style stripping
+    let text = sanitizeHtml(html, {
+        allowedTags: [],       // remove all tags (including script/style)
+        allowedAttributes: {}, // remove all attributes
+        textFilter: function(text) {
+            return text;
+        }
+    });
+
+    // Normalize whitespace and decode common HTML entities if needed
     text = text.replace(/&nbsp;/g, ' ');
     text = text.replace(/&lt;/g, '<');
     text = text.replace(/&gt;/g, '>');
     text = text.replace(/&amp;/g, '&');
     text = text.replace(/&quot;/g, '"');
     text = text.replace(/&#39;/g, "'");
-    
-    // Normalize whitespace
     text = text.replace(/\s+/g, ' ').trim();
-    
     return text;
 }
 
EOF
@@ -14,8 +14,9 @@

import sql from "../services/sql.js";
import log from "../services/log.js";
import { normalize as utilsNormalize, stripTags } from "../services/utils.js";
import { normalize as utilsNormalize } from "../services/utils.js";
import { getSqliteFunctionsService } from "../services/search/sqlite_functions.js";
import sanitizeHtml from "sanitize-html";

/**
* Uses the existing normalize function from utils.ts for consistency
@@ -71,24 +71,23 @@
function stripHtmlTags(html: string): string {
if (!html) return '';

// Remove script and style content entirely first
let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');

// Use utils stripTags for consistency
text = stripTags(text);

// Decode HTML entities
// Use sanitize-html for robust HTML tag and script/style stripping
let text = sanitizeHtml(html, {
allowedTags: [], // remove all tags (including script/style)
allowedAttributes: {}, // remove all attributes
textFilter: function(text) {
return text;
}
});

// Normalize whitespace and decode common HTML entities if needed
text = text.replace(/&nbsp;/g, ' ');
text = text.replace(/&lt;/g, '<');
text = text.replace(/&gt;/g, '>');
text = text.replace(/&amp;/g, '&');
text = text.replace(/&quot;/g, '"');
text = text.replace(/&#39;/g, "'");

// Normalize whitespace
text = text.replace(/\s+/g, ' ').trim();

return text;
}

apps/server/package.json
Outside changed files

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/apps/server/package.json b/apps/server/package.json
--- a/apps/server/package.json
+++ b/apps/server/package.json
@@ -4,7 +4,8 @@
   "description": "The server-side component of TriliumNext, which exposes the client via the web, allows for sync and provides a REST API for both internal and external use.",
   "private": true,
   "dependencies": {
-    "better-sqlite3": "12.2.0"
+    "better-sqlite3": "12.2.0",
+    "sanitize-html": "^2.17.0"
   },
   "devDependencies": {
     "@electron/remote": "2.1.3",
EOF
@@ -4,7 +4,8 @@
"description": "The server-side component of TriliumNext, which exposes the client via the web, allows for sync and provides a REST API for both internal and external use.",
"private": true,
"dependencies": {
"better-sqlite3": "12.2.0"
"better-sqlite3": "12.2.0",
"sanitize-html": "^2.17.0"
},
"devDependencies": {
"@electron/remote": "2.1.3",
This fix introduces these dependencies
Package Version Security advisories
sanitize-html (npm) 2.17.0 None
Copilot is powered by AI and may make mistakes. Always verify output.

// Remove script and style content entirely first
let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');

Check failure

Code scanning / CodeQL

Incomplete multi-character sanitization High

This string may still contain
<style
, which may cause an HTML element injection vulnerability.

Copilot Autofix

AI 15 days ago

The best fix is to ensure that all occurrences of <style>...</style> blocks are fully stripped, even if multiple or nested/malformed instances occur (as attackers might trigger incomplete removal). To do this without changing existing functionality and with minimal impact, repeatedly apply the regex replacements for script/style blocks until the string no longer changes. This ensures that any re-exposed tags are also removed. The change should only affect the two regex replace calls for <script> and <style> in stripHtmlTags() in apps/server/src/migrations/0235__sqlite_native_search.ts. No extra dependencies are needed, only a small refactor of these lines into a loop.


Suggested changeset 1
apps/server/src/migrations/0235__sqlite_native_search.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/apps/server/src/migrations/0235__sqlite_native_search.ts b/apps/server/src/migrations/0235__sqlite_native_search.ts
--- a/apps/server/src/migrations/0235__sqlite_native_search.ts
+++ b/apps/server/src/migrations/0235__sqlite_native_search.ts
@@ -72,8 +72,14 @@
     if (!html) return '';
     
     // Remove script and style content entirely first
-    let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
-    text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
+    let text = html;
+    // Repeatedly remove <script> and <style> blocks until there are none left (to prevent incomplete multi-character sanitization)
+    let prev;
+    do {
+        prev = text;
+        text = text.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
+        text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
+    } while (text !== prev);
     
     // Use utils stripTags for consistency
     text = stripTags(text);
EOF
@@ -72,8 +72,14 @@
if (!html) return '';

// Remove script and style content entirely first
let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
let text = html;
// Repeatedly remove <script> and <style> blocks until there are none left (to prevent incomplete multi-character sanitization)
let prev;
do {
prev = text;
text = text.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
} while (text !== prev);

// Use utils stripTags for consistency
text = stripTags(text);
Copilot is powered by AI and may make mistakes. Always verify output.
text = text.replace(/&nbsp;/g, ' ');
text = text.replace(/&lt;/g, '<');
text = text.replace(/&gt;/g, '>');
text = text.replace(/&amp;/g, '&');

Check failure

Code scanning / CodeQL

Double escaping or unescaping High

This replacement may produce '&' characters that are double-unescaped
here
.

Copilot Autofix

AI 15 days ago

To fix the double-unescaping bug, the replacements for HTML entities must be performed in the correct order: all replacements except &amp; should occur first, and replace &amp; last. In this code, that means re-ordering lines 83–87 so that .replace(/&amp;/g, '&') is last, after the other entity decoding steps.

Make this change in the stripHtmlTags function in apps/server/src/migrations/0235__sqlite_native_search.ts:

  • Move text = text.replace(/&amp;/g, '&'); to after the replacements for &quot; and &#39;.
    No imports or definitions are required; only a change in order of these replacement lines. No additional libraries are necessary.

Suggested changeset 1
apps/server/src/migrations/0235__sqlite_native_search.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/apps/server/src/migrations/0235__sqlite_native_search.ts b/apps/server/src/migrations/0235__sqlite_native_search.ts
--- a/apps/server/src/migrations/0235__sqlite_native_search.ts
+++ b/apps/server/src/migrations/0235__sqlite_native_search.ts
@@ -82,9 +82,9 @@
     text = text.replace(/&nbsp;/g, ' ');
     text = text.replace(/&lt;/g, '<');
     text = text.replace(/&gt;/g, '>');
-    text = text.replace(/&amp;/g, '&');
     text = text.replace(/&quot;/g, '"');
     text = text.replace(/&#39;/g, "'");
+    text = text.replace(/&amp;/g, '&');
     
     // Normalize whitespace
     text = text.replace(/\s+/g, ' ').trim();
EOF
@@ -82,9 +82,9 @@
text = text.replace(/&nbsp;/g, ' ');
text = text.replace(/&lt;/g, '<');
text = text.replace(/&gt;/g, '>');
text = text.replace(/&amp;/g, '&');
text = text.replace(/&quot;/g, '"');
text = text.replace(/&#39;/g, "'");
text = text.replace(/&amp;/g, '&');

// Normalize whitespace
text = text.replace(/\s+/g, ' ').trim();
Copilot is powered by AI and may make mistakes. Always verify output.

// First remove script and style content entirely (including the tags)
// This needs to happen before stripTags to remove the content
text = text.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');

Check failure

Code scanning / CodeQL

Incomplete multi-character sanitization High

This string may still contain
<script
, which may cause an HTML element injection vulnerability.

Copilot Autofix

AI 15 days ago

The best way to fix this is to ensure repeated removal is performed until all script/style tags are fully stripped, regardless of how input mutates during replacement. This can be achieved by using a loop to apply the .replace() until there are no more replacements. Alternatively, use a robust sanitization library such as sanitize-html, but since we're limited to shown code and not introducing major codebase changes, repeated replacement is the most direct and non-breaking fix. Concretely, modify the lines where <script> and <style> tags are removed to repeatedly apply the regex replacements until a fixpoint is reached (nothing more is removed). This must all be done within apps/server/src/services/search/sqlite_functions.ts inside the stripHtml method.

No new imports are needed. Only the removal logic for script/style content must change.


Suggested changeset 1
apps/server/src/services/search/sqlite_functions.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/apps/server/src/services/search/sqlite_functions.ts b/apps/server/src/services/search/sqlite_functions.ts
--- a/apps/server/src/services/search/sqlite_functions.ts
+++ b/apps/server/src/services/search/sqlite_functions.ts
@@ -408,8 +408,17 @@
             
             // First remove script and style content entirely (including the tags)
             // This needs to happen before stripTags to remove the content
-            text = text.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
-            text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
+            // Repeatedly remove script tags to fix incomplete multi-character sanitization
+            let prev;
+            do {
+                prev = text;
+                text = text.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
+            } while (text !== prev);
+            // Repeatedly remove style tags
+            do {
+                prev = text;
+                text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
+            } while (text !== prev);
             
             // Now use stripTags to remove remaining HTML tags
             text = stripTags(text);
EOF
@@ -408,8 +408,17 @@

// First remove script and style content entirely (including the tags)
// This needs to happen before stripTags to remove the content
text = text.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
// Repeatedly remove script tags to fix incomplete multi-character sanitization
let prev;
do {
prev = text;
text = text.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
} while (text !== prev);
// Repeatedly remove style tags
do {
prev = text;
text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
} while (text !== prev);

// Now use stripTags to remove remaining HTML tags
text = stripTags(text);
Copilot is powered by AI and may make mistakes. Always verify output.
// Decode common HTML entities
text = text.replace(/&lt;/g, '<');
text = text.replace(/&gt;/g, '>');
text = text.replace(/&amp;/g, '&');

Check failure

Code scanning / CodeQL

Double escaping or unescaping High

This replacement may produce '&' characters that are double-unescaped
here
.

Copilot Autofix

AI 15 days ago

To fix this issue, reorder the entity unescaping code so that the ampersand entity (&amp;) is decoded last, after all other decodings. This avoids the double unescape problem: if &amp;quot; is present, it remains &amp;quot; while other entities are decoded, until the final pass where &amp; is decoded to &.

Specifically, in apps/server/src/services/search/sqlite_functions.ts, lines 418–424 contain a sequence of .replace() calls for decoding HTML entities. Move the line text = text.replace(/&amp;/g, '&'); (currently line 420) so that it comes immediately after all other .replace() calls for &lt;, &gt;, &quot;, &#39;, &apos;, and &nbsp;. The order should be: unescape all non-ampersand entities first, then unescape ampersand.

No additional imports or logic are needed.


Suggested changeset 1
apps/server/src/services/search/sqlite_functions.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/apps/server/src/services/search/sqlite_functions.ts b/apps/server/src/services/search/sqlite_functions.ts
--- a/apps/server/src/services/search/sqlite_functions.ts
+++ b/apps/server/src/services/search/sqlite_functions.ts
@@ -417,11 +417,11 @@
             // Decode common HTML entities
             text = text.replace(/&lt;/g, '<');
             text = text.replace(/&gt;/g, '>');
-            text = text.replace(/&amp;/g, '&');
             text = text.replace(/&quot;/g, '"');
             text = text.replace(/&#39;/g, "'");
             text = text.replace(/&apos;/g, "'");
             text = text.replace(/&nbsp;/g, ' ');
+            text = text.replace(/&amp;/g, '&');
             
             // Normalize whitespace - reduce multiple spaces to single space
             // But don't trim leading/trailing space if it was from &nbsp;
EOF
@@ -417,11 +417,11 @@
// Decode common HTML entities
text = text.replace(/&lt;/g, '<');
text = text.replace(/&gt;/g, '>');
text = text.replace(/&amp;/g, '&');
text = text.replace(/&quot;/g, '"');
text = text.replace(/&#39;/g, "'");
text = text.replace(/&apos;/g, "'");
text = text.replace(/&nbsp;/g, ' ');
text = text.replace(/&amp;/g, '&');

// Normalize whitespace - reduce multiple spaces to single space
// But don't trim leading/trailing space if it was from &nbsp;
Copilot is powered by AI and may make mistakes. Always verify output.
}

// Remove script and style content
let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');

Check failure

Code scanning / CodeQL

Incomplete multi-character sanitization High

This string may still contain
<script
, which may cause an HTML element injection vulnerability.

Copilot Autofix

AI 15 days ago

The best fix is to repeatedly apply the regular expressions to remove <script> and <style> tags until no further matches are found, ensuring that malicious input cannot survive through incomplete multi-character replacements. This will guarantee that all possible instances, including those emerging after each replacement, are eliminated. The fix should be limited to the processHtmlContent function in this file. No code outside that function will be edited. No new dependencies are strictly required for this fix, but if the project would prefer to use a well-tested HTML sanitization library (like sanitize-html), that would be even better; however, per instructions, I will stick to the repeated replacement method inside the code region provided.


Suggested changeset 1
apps/server/src/services/search/sqlite_search_utils.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/apps/server/src/services/search/sqlite_search_utils.ts b/apps/server/src/services/search/sqlite_search_utils.ts
--- a/apps/server/src/services/search/sqlite_search_utils.ts
+++ b/apps/server/src/services/search/sqlite_search_utils.ts
@@ -115,10 +115,19 @@
         return '';
     }
     
-    // Remove script and style content
-    let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
-    text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
-    
+    // Remove script and style content (repeatedly until none remain)
+    let text = html;
+    // Remove all <script>...</script>
+    let prev;
+    do {
+        prev = text;
+        text = text.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
+    } while (text !== prev);
+    // Remove all <style>...</style>
+    do {
+        prev = text;
+        text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
+    } while (text !== prev);
     // Strip remaining tags
     text = stripTags(text);
     
EOF
@@ -115,10 +115,19 @@
return '';
}

// Remove script and style content
let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');

// Remove script and style content (repeatedly until none remain)
let text = html;
// Remove all <script>...</script>
let prev;
do {
prev = text;
text = text.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
} while (text !== prev);
// Remove all <style>...</style>
do {
prev = text;
text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
} while (text !== prev);
// Strip remaining tags
text = stripTags(text);

Copilot is powered by AI and may make mistakes. Always verify output.
}

// Remove script and style content
let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');

Check failure

Code scanning / CodeQL

Bad HTML filtering regexp High

This regular expression does not match script end tags like </script >.

Copilot Autofix

AI 15 days ago

The best way to fix this is to avoid using regex for HTML sanitization and instead use a well-known HTML parser or sanitization library. If you must use regex for a very specific use-case, at least make your regex robust against browser quirks. In this context, since you're already using a utility function called stripTags, try to utilize a well-established library such as sanitize-html or dompurify via Node, but per the instruction only use code you've been shown. Therefore, update the regex on line 119 to more robustly match closing <script> tags with optional whitespace and attributes (so it matches tags like </script > or </script foo="bar">). A good improvement is to match </script\b[^>]*> (case-insensitive).

Edit line 119 in apps/server/src/services/search/sqlite_search_utils.ts:

  • Replace the regex <script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script> with <script\b[^<]*(?:(?!<\/script[\s>])<[^<]*)*<\/script[\s\S]*?>, which is more robust against browser quirks and will match closing </script>, </script >, and </script foo="bar">.
  • Similarly, update the <style> removal regex on line 120.

No new imports or method definitions are required, unless you decide to use an external sanitization library (which would require a dependency), but we restrict ourselves to just improving the regex as per instructions.

Suggested changeset 1
apps/server/src/services/search/sqlite_search_utils.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/apps/server/src/services/search/sqlite_search_utils.ts b/apps/server/src/services/search/sqlite_search_utils.ts
--- a/apps/server/src/services/search/sqlite_search_utils.ts
+++ b/apps/server/src/services/search/sqlite_search_utils.ts
@@ -115,9 +115,9 @@
         return '';
     }
     
-    // Remove script and style content
-    let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
-    text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
+    // Remove script and style content (make regex robust against alternative closing tags and whitespace)
+    let text = html.replace(/<script\b[^<]*(?:(?!<\/script[\s>])<[^<]*)*<\/script[\s\S]*?>/gi, '');
+    text = text.replace(/<style\b[^<]*(?:(?!<\/style[\s>])<[^<]*)*<\/style[\s\S]*?>/gi, '');
     
     // Strip remaining tags
     text = stripTags(text);
EOF
@@ -115,9 +115,9 @@
return '';
}

// Remove script and style content
let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
// Remove script and style content (make regex robust against alternative closing tags and whitespace)
let text = html.replace(/<script\b[^<]*(?:(?!<\/script[\s>])<[^<]*)*<\/script[\s\S]*?>/gi, '');
text = text.replace(/<style\b[^<]*(?:(?!<\/style[\s>])<[^<]*)*<\/style[\s\S]*?>/gi, '');

// Strip remaining tags
text = stripTags(text);
Copilot is powered by AI and may make mistakes. Always verify output.

// Remove script and style content
let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');

Check failure

Code scanning / CodeQL

Incomplete multi-character sanitization High

This string may still contain
<style
, which may cause an HTML element injection vulnerability.

Copilot Autofix

AI 15 days ago

General Fix:
Ensure that all <style> tags and their contents are completely removed, even if new matches arise after each replacement. One straightforward solution is to reapply the replacement repeatedly until no further changes occur.

Best Specific Fix:
Modify the code so that the .replace(<style...>) (and likewise the <script...> removal) are applied inside a loop that continues until the string no longer changes. This can be implemented with a do...while loop for each pattern.

Region to Change:
Lines 119–120 in processHtmlContent, inside apps/server/src/services/search/sqlite_search_utils.ts.

Required Methods/Imports:
No new methods or imports are needed; only local logic changes.


Suggested changeset 1
apps/server/src/services/search/sqlite_search_utils.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/apps/server/src/services/search/sqlite_search_utils.ts b/apps/server/src/services/search/sqlite_search_utils.ts
--- a/apps/server/src/services/search/sqlite_search_utils.ts
+++ b/apps/server/src/services/search/sqlite_search_utils.ts
@@ -115,9 +115,19 @@
         return '';
     }
     
-    // Remove script and style content
-    let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
-    text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
+    // Remove script and style content completely, applying multiple times if necessary
+    let text = html;
+    let prevText;
+    // Remove all <script>...</script> tags
+    do {
+        prevText = text;
+        text = text.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
+    } while (text !== prevText);
+    // Remove all <style>...</style> tags
+    do {
+        prevText = text;
+        text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
+    } while (text !== prevText);
     
     // Strip remaining tags
     text = stripTags(text);
EOF
@@ -115,9 +115,19 @@
return '';
}

// Remove script and style content
let text = html.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
// Remove script and style content completely, applying multiple times if necessary
let text = html;
let prevText;
// Remove all <script>...</script> tags
do {
prevText = text;
text = text.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
} while (text !== prevText);
// Remove all <style>...</style> tags
do {
prevText = text;
text = text.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, '');
} while (text !== prevText);

// Strip remaining tags
text = stripTags(text);
Copilot is powered by AI and may make mistakes. Always verify output.
text = text.replace(/&nbsp;/g, ' ');
text = text.replace(/&lt;/g, '<');
text = text.replace(/&gt;/g, '>');
text = text.replace(/&amp;/g, '&');

Check failure

Code scanning / CodeQL

Double escaping or unescaping High

This replacement may produce '&' characters that are double-unescaped
here
.

Copilot Autofix

AI 15 days ago

The underlying problem is the ordering of HTML entity replacements in the "unescape" (decode) section of the processHtmlContent function. To fix this, reorder the string replacement statements so that text.replace(/&amp;/g, '&') is performed after all other entity replacements. This ensures that entities like &amp;quot; are converted first to &quot; (in their intended single-escape form), and only then to (") when &quot; is replaced. Only modify the order of these lines: lines 126 to 132, so that &amp; is last.

No new imports or methods are needed, as the code is doing simple string replacements.


Suggested changeset 1
apps/server/src/services/search/sqlite_search_utils.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/apps/server/src/services/search/sqlite_search_utils.ts b/apps/server/src/services/search/sqlite_search_utils.ts
--- a/apps/server/src/services/search/sqlite_search_utils.ts
+++ b/apps/server/src/services/search/sqlite_search_utils.ts
@@ -126,10 +126,10 @@
     text = text.replace(/&nbsp;/g, ' ');
     text = text.replace(/&lt;/g, '<');
     text = text.replace(/&gt;/g, '>');
-    text = text.replace(/&amp;/g, '&');
     text = text.replace(/&quot;/g, '"');
     text = text.replace(/&#39;/g, "'");
     text = text.replace(/&apos;/g, "'");
+    text = text.replace(/&amp;/g, '&');
     
     // Normalize whitespace
     text = text.replace(/\s+/g, ' ').trim();
EOF
@@ -126,10 +126,10 @@
text = text.replace(/&nbsp;/g, ' ');
text = text.replace(/&lt;/g, '<');
text = text.replace(/&gt;/g, '>');
text = text.replace(/&amp;/g, '&');
text = text.replace(/&quot;/g, '"');
text = text.replace(/&#39;/g, "'");
text = text.replace(/&apos;/g, "'");
text = text.replace(/&amp;/g, '&');

// Normalize whitespace
text = text.replace(/\s+/g, ' ').trim();
Copilot is powered by AI and may make mistakes. Always verify output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant