Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
164 changes: 164 additions & 0 deletions docs/solr-escaping-definition.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# Solr Escaping Definition

## Purpose

This document defines where, when, and how escaping must occur when building Solr/Lucene queries.
Follow this document to avoid producing invalid queries, broken boolean logic, or double-escaping bugs.

## Core Rule (Read This First)

Escaping is context-dependent and MUST happen at the final embedding step only.

❌ Never escape raw user input
❌ Never escape tokens
❌ Never escape before semantics are known
❌ Escaping early destroys intent.

✅ Escape only when inserting a value into field:( … )

## Query Construction Pipeline

The query builder has four distinct layers. Each layer has a strict responsibility.

1. Raw input
2. Normalization & tokenization
3. Semantic expansion
4. Context-aware escaping

### Layer Responsibilities

1. Raw Input

Example: `nature, and history`

- Allowed:
- Trimming
- Unicode normalization
- Validation (balanced quotes, no junk)
- Forbidden:
- Escaping
- Quoting
- Backslashes

2. Tokenization & Normalization

**Input**: `nature, and history`
**Tokens**: `["nature,", "and", "history"]`

- Allowed:
- Splitting into tokens
- Lowercasing (optional)
- Forbidden:
- Escaping tokens
- Detecting boolean meaning
- Adding quotes

3. Semantic Expansion (NO ESCAPING)

The semantic structure defines intent, not syntax.

``` $values = [
'onephrase' => '"nature, and history"',
'and' => 'nature AND and AND history',
'or' => 'nature OR and OR history',
'asis' => 'nature, and history', // raw input as tge yser typed it.
'compressed' => 'nature,andhistory',
'exactmatcher' => 'natureandhistory',
'emstartswith' => 'natureandhistory*'
];
```

- Allowed:
- Adding AND / OR
- Adding quotes for phrases
- Adding * for prefixes
- Forbidden:
- Escaping special characters
- Adding backslashes
- Lucene syntax manipulation

4. Context-Aware Escaping (MANDATORY)

Escaping happens only here, inside `__buildQueryString()` in `Solr.php`.

Each semantic value MUST use the correct escaper.

**Semantic Type Escaper Purpose**
onephrase escapePhrase() Quoted phrase
and, or escapeBoolean() Boolean expression
emstartswith escapePrefix() Prefix query
everything else escapeTerm() Literal term

## Escaper Contracts

`escapeTerm(string $input)`

- Escapes Lucene special characters
- DOES NOT add quotes
- DOES NOT interpret boolean logic

✅ Correct for: exact terms, IDs, compressed strings
❌ Wrong for: phrases see `escapePhrase()`

`escapePhrase(string $input)`

- Removes outer quotes if present
- Escapes inner content
- Re-adds exactly one pair of quotes

**Input**: `"nature, and history"`
**Output**: `"nature\, and history"`

❌ Must never double-quote
❌ Must never accept pre-escaped strings

`escapeBoolean(string $expr)`

- Preserves AND/OR operators
- Escapes only operands
- MUST NOT quote operands automatically

**Input**: `nature AND and AND history`
**Output**: `nature AND and AND history`

(operators preserved, literals escaped as needed)

❌ Must not introduce "and"
❌ Must not escape operators

`escapePrefix(string $prefix)`

- Removes trailing *
- Escapes base term
- Re-adds *

**Input**: `natureandhistory*`
**Output**: `natureandhistory*`

❌ Must never allow leading wildcards

## Explicit Anti-Patterns (DO NOT DO THESE)

❌ Escaping inside tokenization
❌ Escaping $lookfor directly
❌ Escaping tokens before implode()
❌ Quoting inside escapeBoolean()
❌ Passing pre-escaped strings into escapers
❌ Escaping twice “just to be safe”

## Maintenance Checklist (Before Any Change)

Before modifying escaping logic, confirm:

- Am I escaping only at embedding time?
- Does this escaper handle exactly one semantic role?
- Could this cause double-quoting?
- Could this break boolean operators?
- Do existing tests still pass?
- If unsure → STOP and re-read this document.

## Summary
Semantics first. Escaping last. Always.
17 changes: 6 additions & 11 deletions playwright/test/fast/bugfix/search-quotes.spec.js
Original file line number Diff line number Diff line change
@@ -1,15 +1,5 @@
import { test, expect } from '@playwright/test';

test('smartquote_problem_talking_catalog', async ({ page, baseURL }) => {
const warning = 'There was a problem talking to the catalog.'

await page.goto('/Search/Home?lookfor=smart'); // no smart quote, no problem, no warning
await expect(page.getByText(warning)).not.toBeVisible();

await page.goto('/Search/Home?lookfor=“smart'); // smart quote, problem, warning
await expect(page.getByText(warning)).toBeVisible();
});

test('unbalanced_quotes', async ({ page, baseURL }) => {
/*
If an odd number of quotes are given in the search string, we remove the last one
Expand All @@ -34,5 +24,10 @@ test('unbalanced_quotes', async ({ page, baseURL }) => {
await page.goto('/Search/Home?lookfor="norfolk""'); // 3 quotes, unbalanced:
await expect(page.getByText(warning)).toBeVisible(); // expect warn
await expect(page.getByText(chaucer)).toBeVisible(); // results visible
// etc

// Unbalanced funcy quotes
await page.goto('/Search/Home?lookfor=“norfolk'); // 1 funcy quote, unbalanced:
await expect(page.getByText(warning)).toBeVisible(); // expect warn
await expect(page.getByText(chaucer)).toBeVisible(); // results visible

});
1 change: 1 addition & 0 deletions services/Search/Home.php
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ function launch()
}
$this->search();
} else {

// Otherwie, display the home page
$interface->setPageTitle('Search Home');
$interface->assign('isTheHomePage', true);
Expand Down
37 changes: 33 additions & 4 deletions services/Search/SearchStructure.php
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ static function fromHash($hash)
$c = __CLASS__;
$obj = new $c(true);
$obj->_fillFromHash($hash);

return $obj;
}

Expand Down Expand Up @@ -230,6 +231,7 @@ private function _fillFromHash($hash)
} elseif (count($ss) == 1) {

$index = $ss[0][0];

if (isset($this->force_standard[$index]) && $this->force_standard[$index]) {
$this->use_dismax = false;
}
Expand Down Expand Up @@ -925,6 +927,9 @@ static function displayStrip($v)
}
return "Between $start and $end";
} else {
// Remove square brackets and double quote from facet display.
// Dec 2025 Note: this is just a UI string, so this replacement should be unnecessary.
// return $v;
return preg_replace('/[\[\]\"]/', '', $v);
}

Expand Down Expand Up @@ -1023,6 +1028,7 @@ function facetDisplayName($facet)


// Take from http://www.toao.net/48-replacing-smart-quotes-and-em-dashes-in-mysql
// TODO: That function is never used
function convert_smart_quotes($text)
{
// First, replace UTF-8 characters.
Expand All @@ -1038,14 +1044,37 @@ function convert_smart_quotes($text)
return $text;
}

/**
* Normalize Unicode quotes to ASCII equivalents.
*/
private function normalizeQuotes(string $str): string
{
$map = [
// Double quotes
'“' => '"', '”' => '"', '„' => '"', '‟' => '"',

// Single quotes
'‘' => "'", '’' => "'", '‚' => "'", '‛' => "'",
];

return strtr($str, $map);
}

# Detects an odd number of double quotes and removes the last one.
function fix_unbalanced_quotes($str)
{
// Normalize smart / Unicode quotes first
$str = $this->normalizeQuotes($str);

//Count ASCII double quotes
if (substr_count($str, '"', 0) % 2 != 0) {
$pos = strrpos($str, "\"", -1);
# So we can inform the user
$this->fixedUnbalancedQuotes = true;
$str = substr_replace($str, '', $pos, 1);
# Find the last occurrence of a quote and remove it
$pos = strrpos($str, '"');
if ($pos !== false) {
# So we can inform the user
$this->fixedUnbalancedQuotes = true;
$str = substr_replace($str, '', $pos, 1);
}
}
return $str;
}
Expand Down
Loading