-
Notifications
You must be signed in to change notification settings - Fork 159
better favicon handling #466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
| const patterns = [ | ||
| // name attribute with double-quoted content | ||
| new RegExp( | ||
| `<meta\\s+[^>]*name\\s*=\\s*["']${nameOrProperty}["'][^>]*content\\s*=\\s*"([^"]*)"`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nameOrProperty parameter is directly interpolated into regex patterns without escaping regex metacharacters, creating a potential regex injection vulnerability and correctness issue if dynamic meta tag names are ever used.
View Details
📝 Patch Details
diff --git a/apps/scan/src/services/scraper/metadata.ts b/apps/scan/src/services/scraper/metadata.ts
index 87f62a55..31b96ca4 100644
--- a/apps/scan/src/services/scraper/metadata.ts
+++ b/apps/scan/src/services/scraper/metadata.ts
@@ -4,6 +4,13 @@ interface Metadata {
url?: string;
}
+/**
+ * Escapes regex metacharacters to prevent regex injection and ensure literal matching
+ */
+const escapeRegex = (str: string): string => {
+ return str.replace(/[.*+?^${}()|[\]\\]/g, '\\<!-- PATCH_DETAILS_PLACEHOLDER -->');
+};
+
/**
* Extracts the content from a meta tag
* Uses separate patterns for double-quoted and single-quoted values to handle quotes in content
@@ -12,47 +19,51 @@ const extractMetaContent = (
html: string,
nameOrProperty: string
): string | null => {
+ // Escape regex metacharacters in the name/property to prevent regex injection
+ // and ensure exact matching of literal characters
+ const escapedNameOrProperty = escapeRegex(nameOrProperty);
+
// Match meta tags with name or property attribute
// Use separate patterns for each quote type to properly handle quotes in values
const patterns = [
// name attribute with double-quoted content
new RegExp(
- `<meta\\s+[^>]*name\\s*=\\s*["']${nameOrProperty}["'][^>]*content\\s*=\\s*"([^"]*)"`,
+ `<meta\\s+[^>]*name\\s*=\\s*["']${escapedNameOrProperty}["'][^>]*content\\s*=\\s*"([^"]*)"`,
'i'
),
// name attribute with single-quoted content
new RegExp(
- `<meta\\s+[^>]*name\\s*=\\s*["']${nameOrProperty}["'][^>]*content\\s*=\\s*'([^']*)'`,
+ `<meta\\s+[^>]*name\\s*=\\s*["']${escapedNameOrProperty}["'][^>]*content\\s*=\\s*'([^']*)'`,
'i'
),
// content before name with double quotes
new RegExp(
- `<meta\\s+[^>]*content\\s*=\\s*"([^"]*)"[^>]*name\\s*=\\s*["']${nameOrProperty}["']`,
+ `<meta\\s+[^>]*content\\s*=\\s*"([^"]*)"[^>]*name\\s*=\\s*["']${escapedNameOrProperty}["']`,
'i'
),
// content before name with single quotes
new RegExp(
- `<meta\\s+[^>]*content\\s*=\\s*'([^']*)'[^>]*name\\s*=\\s*["']${nameOrProperty}["']`,
+ `<meta\\s+[^>]*content\\s*=\\s*'([^']*)'[^>]*name\\s*=\\s*["']${escapedNameOrProperty}["']`,
'i'
),
// property attribute with double-quoted content
new RegExp(
- `<meta\\s+[^>]*property\\s*=\\s*["']${nameOrProperty}["'][^>]*content\\s*=\\s*"([^"]*)"`,
+ `<meta\\s+[^>]*property\\s*=\\s*["']${escapedNameOrProperty}["'][^>]*content\\s*=\\s*"([^"]*)"`,
'i'
),
// property attribute with single-quoted content
new RegExp(
- `<meta\\s+[^>]*property\\s*=\\s*["']${nameOrProperty}["'][^>]*content\\s*=\\s*'([^']*)'`,
+ `<meta\\s+[^>]*property\\s*=\\s*["']${escapedNameOrProperty}["'][^>]*content\\s*=\\s*'([^']*)'`,
'i'
),
// content before property with double quotes
new RegExp(
- `<meta\\s+[^>]*content\\s*=\\s*"([^"]*)"[^>]*property\\s*=\\s*["']${nameOrProperty}["']`,
+ `<meta\\s+[^>]*content\\s*=\\s*"([^"]*)"[^>]*property\\s*=\\s*["']${escapedNameOrProperty}["']`,
'i'
),
// content before property with single quotes
new RegExp(
- `<meta\\s+[^>]*content\\s*=\\s*'([^']*)'[^>]*property\\s*=\\s*["']${nameOrProperty}["']`,
+ `<meta\\s+[^>]*content\\s*=\\s*'([^']*)'[^>]*property\\s*=\\s*["']${escapedNameOrProperty}["']`,
'i'
),
];
Analysis
Unescaped regex metacharacters in extractMetaContent() nameOrProperty parameter
What fails: extractMetaContent() in apps/scan/src/services/scraper/metadata.ts fails to match meta tags with names/properties containing regex metacharacters like [, ], *, +, ?, (, ), {, }, ^, $, |, ., or \. These characters are interpreted as regex operators instead of literals, preventing correct matching.
How to reproduce:
const escapeRegex = (str) => {
return str.replace(/[.*+?^ Result: Returns null - unable to match the meta tag
Expected: Should return "first item" - the meta tag with brackets in the name should be matched literally
Fix applied: Added escapeRegex() helper function to escape regex metacharacters before interpolating into regex patterns, ensuring all 8 regex patterns in extractMetaContent() treat the parameter as literal text rather than regex operators.
| attrName: string | ||
| ): string | null => { | ||
| // Try double-quoted value first | ||
| let match = new RegExp(`${attrName}\\s*=\\s*"([^"]*)"`).exec(attrs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The attrName parameter is directly interpolated into regex patterns without escaping regex metacharacters, creating a potential regex injection vulnerability if dynamic attribute names are ever used.
View Details
📝 Patch Details
diff --git a/apps/scan/src/services/scraper/favicon.ts b/apps/scan/src/services/scraper/favicon.ts
index b8ce21c1..bbc2a3c1 100644
--- a/apps/scan/src/services/scraper/favicon.ts
+++ b/apps/scan/src/services/scraper/favicon.ts
@@ -65,6 +65,13 @@ const getIconPriority = (
return 6;
};
+/**
+ * Escapes regex metacharacters in a string to prevent regex injection
+ */
+const escapeRegexMetacharacters = (str: string): string => {
+ return str.replace(/[.*+?^${}()|[\]\\]/g, '\\<!-- PATCH_DETAILS_PLACEHOLDER -->');
+};
+
/**
* Extracts an attribute value, handling both double and single quotes
* This allows values to contain the opposite quote character (e.g., apostrophes in double-quoted values)
@@ -73,12 +80,14 @@ const extractAttributeValue = (
attrs: string,
attrName: string
): string | null => {
+ const escapedName = escapeRegexMetacharacters(attrName);
+
// Try double-quoted value first
- let match = new RegExp(`${attrName}\\s*=\\s*"([^"]*)"`).exec(attrs);
+ let match = new RegExp(`${escapedName}\\s*=\\s*"([^"]*)"`).exec(attrs);
if (match?.[1] !== undefined) return match[1];
// Try single-quoted value
- match = new RegExp(`${attrName}\\s*=\\s*'([^']*)'`).exec(attrs);
+ match = new RegExp(`${escapedName}\\s*=\\s*'([^']*)'`).exec(attrs);
if (match?.[1] !== undefined) return match[1];
return null;
diff --git a/apps/scan/src/services/scraper/metadata.ts b/apps/scan/src/services/scraper/metadata.ts
index 87f62a55..4695886d 100644
--- a/apps/scan/src/services/scraper/metadata.ts
+++ b/apps/scan/src/services/scraper/metadata.ts
@@ -4,6 +4,13 @@ interface Metadata {
url?: string;
}
+/**
+ * Escapes regex metacharacters in a string to prevent regex injection
+ */
+const escapeRegexMetacharacters = (str: string): string => {
+ return str.replace(/[.*+?^${}()|[\]\\]/g, '\\<!-- PATCH_DETAILS_PLACEHOLDER -->');
+};
+
/**
* Extracts the content from a meta tag
* Uses separate patterns for double-quoted and single-quoted values to handle quotes in content
@@ -12,47 +19,49 @@ const extractMetaContent = (
html: string,
nameOrProperty: string
): string | null => {
+ const escaped = escapeRegexMetacharacters(nameOrProperty);
+
// Match meta tags with name or property attribute
// Use separate patterns for each quote type to properly handle quotes in values
const patterns = [
// name attribute with double-quoted content
new RegExp(
- `<meta\\s+[^>]*name\\s*=\\s*["']${nameOrProperty}["'][^>]*content\\s*=\\s*"([^"]*)"`,
+ `<meta\\s+[^>]*name\\s*=\\s*["']${escaped}["'][^>]*content\\s*=\\s*"([^"]*)"`,
'i'
),
// name attribute with single-quoted content
new RegExp(
- `<meta\\s+[^>]*name\\s*=\\s*["']${nameOrProperty}["'][^>]*content\\s*=\\s*'([^']*)'`,
+ `<meta\\s+[^>]*name\\s*=\\s*["']${escaped}["'][^>]*content\\s*=\\s*'([^']*)'`,
'i'
),
// content before name with double quotes
new RegExp(
- `<meta\\s+[^>]*content\\s*=\\s*"([^"]*)"[^>]*name\\s*=\\s*["']${nameOrProperty}["']`,
+ `<meta\\s+[^>]*content\\s*=\\s*"([^"]*)"[^>]*name\\s*=\\s*["']${escaped}["']`,
'i'
),
// content before name with single quotes
new RegExp(
- `<meta\\s+[^>]*content\\s*=\\s*'([^']*)'[^>]*name\\s*=\\s*["']${nameOrProperty}["']`,
+ `<meta\\s+[^>]*content\\s*=\\s*'([^']*)'[^>]*name\\s*=\\s*["']${escaped}["']`,
'i'
),
// property attribute with double-quoted content
new RegExp(
- `<meta\\s+[^>]*property\\s*=\\s*["']${nameOrProperty}["'][^>]*content\\s*=\\s*"([^"]*)"`,
+ `<meta\\s+[^>]*property\\s*=\\s*["']${escaped}["'][^>]*content\\s*=\\s*"([^"]*)"`,
'i'
),
// property attribute with single-quoted content
new RegExp(
- `<meta\\s+[^>]*property\\s*=\\s*["']${nameOrProperty}["'][^>]*content\\s*=\\s*'([^']*)'`,
+ `<meta\\s+[^>]*property\\s*=\\s*["']${escaped}["'][^>]*content\\s*=\\s*'([^']*)'`,
'i'
),
// content before property with double quotes
new RegExp(
- `<meta\\s+[^>]*content\\s*=\\s*"([^"]*)"[^>]*property\\s*=\\s*["']${nameOrProperty}["']`,
+ `<meta\\s+[^>]*content\\s*=\\s*"([^"]*)"[^>]*property\\s*=\\s*["']${escaped}["']`,
'i'
),
// content before property with single quotes
new RegExp(
- `<meta\\s+[^>]*content\\s*=\\s*'([^']*)'[^>]*property\\s*=\\s*["']${nameOrProperty}["']`,
+ `<meta\\s+[^>]*content\\s*=\\s*'([^']*)'[^>]*property\\s*=\\s*["']${escaped}["']`,
'i'
),
];
Analysis
Regex Injection Vulnerability in extractAttributeValue() and extractMetaContent()
What fails: extractAttributeValue() in favicon.ts and extractMetaContent() in metadata.ts interpolate user-controllable parameters directly into RegExp patterns without escaping regex metacharacters. This causes:
- Regex metacharacters in attribute names to be misinterpreted as regex patterns rather than literal characters
- Incorrect matching behavior when attribute names contain metacharacters like
.,*,+,[,],^,$,{,},(,),|,\
How to reproduce:
// Attribute name with dot (. matches any character in regex)
const attrs = 'data.name="value1" dataxname="value2"';
const result = extractAttributeValue(attrs, 'data.name'); // Should find "value1"
// Before fix: returns "value1" ✓ (but only by accident - . matches any character)
// Problem: If HTML had 'dataxname', it would also match (because . matches x)
// Attribute name with asterisk (* means zero or more in regex)
const attrs2 = 'data*name="test-value"';
const result2 = extractAttributeValue(attrs2, 'data*name');
// Before fix: returns null (incorrectly fails)
// After fix: returns "test-value" ✓
// Attribute name with plus (+ means one or more in regex)
const attrs3 = 'data+name="correct" dataname="wrong"';
const result3 = extractAttributeValue(attrs3, 'data+name');
// Before fix: returns "wrong" (matches 'dataname' instead of 'data+name')
// After fix: returns "correct" ✓Result: Regex metacharacters cause incorrect regex patterns to be generated, leading to wrong matching behavior. For example, data+name becomes /data+name/ which matches "dataaaa...name" instead of literally "data+name".
Expected: All parameters should be escaped before interpolation into RegExp patterns to ensure they are treated as literal strings rather than regex metacharacters.
Fix applied: Added escapeRegexMetacharacters() utility function that escapes all regex metacharacters using .replace(/[.*+?^ before interpolating parameters into RegExp patterns in both files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fmhall not sure how valid this is
| // Handle API subdomain fallback | ||
| if (origin.startsWith('https://api.')) { | ||
| origin = `https://${origin.slice(12)}`; | ||
| const baseOrigin = `https://${origin.slice(12)}`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When processing API subdomains, the origin field is always updated to the base domain even when all data was successfully retrieved from the API subdomain, causing the function to incorrectly report the data source.
View Details
📝 Patch Details
diff --git a/apps/scan/src/services/scraper/index.ts b/apps/scan/src/services/scraper/index.ts
index 78667617..9372b273 100644
--- a/apps/scan/src/services/scraper/index.ts
+++ b/apps/scan/src/services/scraper/index.ts
@@ -67,20 +67,26 @@ export const scrapeOriginData = async (inputOrigin: string) => {
const baseData = await parseAllFromHtml(baseHtml, baseOrigin);
// Fill in missing data from base domain
+ let dataFilledFromBase = false;
if (needsOg && hasOgData(baseData.og)) {
og = baseData.og;
+ dataFilledFromBase = true;
}
if (needsMetadata && hasMetadata(baseData.metadata)) {
metadata = baseData.metadata;
+ dataFilledFromBase = true;
}
if (needsFavicon && baseData.favicon) {
favicon = baseData.favicon;
+ dataFilledFromBase = true;
+ }
+
+ // Only update origin to base if we actually used data from it
+ if (dataFilledFromBase) {
+ origin = baseOrigin;
}
}
}
-
- // Update origin to base for URL resolution
- origin = baseOrigin;
}
return {
Analysis
Origin field incorrectly updated to base domain in scrapeOriginData() even when API subdomain data is complete
What fails: The scrapeOriginData() function in apps/scan/src/services/scraper/index.ts always updates the returned origin field to the base domain when the input origin starts with https://api., regardless of whether data was actually fetched from the base domain. This causes the function to report incorrect source information.
How to reproduce:
- Call
scrapeOriginData('https://api.example.com') - The API subdomain successfully returns complete HTML with OG data, metadata, and favicon
- Verify the return value's
originfield
Result: Returns origin: 'https://example.com' even though all data came from https://api.example.com
Expected: Should return origin: 'https://api.example.com' since no data was fetched from the base domain
Root cause: Line 83 in the original code unconditionally executes origin = baseOrigin; whenever the condition origin.startsWith('https://api.') is true, regardless of whether any data was actually filled from the base domain.
Impact: The developer.ts endpoint returns incorrect scrapedOrigin in the preview response, causing users to see wrong source information. This also affects any downstream code that relies on the origin field to determine which domain the metadata came from.
|
I would do a DOM parser instead of a regex in extractIconCandidates |
No description provided.