Add support for searching stopwords as keywords#9117
Add support for searching stopwords as keywords#9117eth3lbert wants to merge 3 commits intorust-lang:mainfrom
Conversation
| let qs = sql::<TsQuery>("plainto_tsquery('english', ") | ||
| .bind::<Text, _>(q_string) | ||
| .sql(")"); |
There was a problem hiding this comment.
Note: This could also be refactored to plainto_tsquery_with_search_config, which I've implemented upstream in diesel-rs/diesel_full_text_search#41. I incorrectly used to_tsquery_with_search_config in the #7941 , which is not equivalent to plainto_tsquery and caused an issue #8052 . And since carol has concerns about it so I haven't modified it in this PR.
| length(to_tsvector_with_search_config::<Text, _, _>( | ||
| TsConfigurationByName("english"), | ||
| q_string, | ||
| )) |
There was a problem hiding this comment.
We could also construct the sql ourselves if we have concerns about it.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #9117 +/- ##
==========================================
+ Coverage 89.20% 89.24% +0.04%
==========================================
Files 282 282
Lines 28513 28610 +97
==========================================
+ Hits 25435 25534 +99
+ Misses 3078 3076 -2 ☔ View full report in Codecov by Sentry. |
|
Additionally, while implementing this, I discovered a potential opportunity to improve the query speed of the search query. However, it would require leveraging a |
| -- UPDATE crates | ||
| -- SET updated_at = updated_at | ||
| -- FROM keywords_with_stopwords | ||
| -- WHERE id = crate_id AND NOT (keyword || ':B')::tsquery @@ textsearchable_index_col |
There was a problem hiding this comment.
Since our keyword doesn't contain spaces, it's safe to directly cast it to a tsquery.
|
In addition to searching the query string as a |
that is actually already supported, though only on the frontend, which transforms the |
| -- WHERE length(to_tsvector('english', keyword)) = 0 | ||
| -- ) | ||
| -- UPDATE crates | ||
| -- SET updated_at = updated_at |
There was a problem hiding this comment.
do we not have to set any other columns? 🤔
There was a problem hiding this comment.
No, because we don't want to modify the updated_at value. We only need to trigger trigger_crates_tsvector_update to update the textsearchable_index_col column, and I think this is sufficient.
There was a problem hiding this comment.
just to be sure, I assume you checked that this actually does not update the updated_at column? 😅
(I'm a bit confused by how the automatic update is currently implemented... 🫣)
There was a problem hiding this comment.
(I'm a bit confused by how the automatic update is currently implemented... 🫣)
Yes, this is quite opaque as it relies on a trigger behind the scenes.
We could inspect the triggers with the following sql:
SELECT event_object_table, action_order, trigger_name, event_manipulation
FROM information_schema.triggers
WHERE event_object_table = 'crates'
; event_object_table | action_order | trigger_name | event_manipulation
--------------------+--------------+----------------------------------------+--------------------
crates | 1 | insert_crate_downloads_row | INSERT
crates | 1 | trigger_crates_tsvector_update | INSERT
crates | 2 | trigger_ensure_crate_name_not_reserved | INSERT
crates | 1 | trigger_crates_set_updated_at | UPDATE
crates | 2 | trigger_crates_tsvector_update | UPDATE
crates | 3 | trigger_ensure_crate_name_not_reserved | UPDATE
(6 rows)
just to be sure, I assume you checked that this actually does not update the
updated_atcolumn? 😅
And yes, the results of the following, which updates regardless of whether the record exists or not, indicate that no updated_at value is currently current_datetime:
date && psql cargo_registry <<EOF
WITH keywords_with_stopwords as (
SELECT crate_id, keyword
FROM keywords JOIN crates_keywords ON id = keyword_id
WHERE length(to_tsvector('english', keyword)) = 0
), upt as (
UPDATE crates
SET updated_at = updated_at
FROM keywords_with_stopwords
WHERE id = crate_id
-- AND NOT (keyword || ':B')::tsquery @@ textsearchable_index_col
returning id
)
SELECT min(updated_at), max(updated_at), count(*)
FROM crates
WHERE id in (SELECT * from upt)
;
EOF
Tue Jul 23 16:07:26 CST 2024
min | max | count
----------------------------+----------------------------+-------
2015-12-16 00:01:49.263868 | 2024-07-13 01:40:09.733953 | 396
(1 row)| // If the query string is not a stop word, search using `plainto_tsquery(...)`. | ||
| // Else if the it is a valid keyword, search by casting it to `tsquery` with weight B(keyword). | ||
| // Otherwise, search using `null::tsquery`. |
There was a problem hiding this comment.
what happens if the query string consists of multiple terms and only a subset of them are stopwords?
There was a problem hiding this comment.
Let's consider two terms separated by either a dash or a space.
With the following sql:
WITH data AS (
SELECT *
FROM
(
VALUES
('any-one'),
('any one'),
('one-any'),
('one any')
) q(s)
)
SELECT
s,
length(
to_tsvector('english', s)
),
plainto_tsquery('english', s),
CASE WHEN (
(
length(
to_tsvector('english', s)
) != 0
)
) THEN ('plainto_tsquery')
WHEN ('t') THEN ('cast')
ELSE 'null' END
FROM data; s | length | plainto_tsquery | case
---------+--------+-------------------+-----------------
any-one | 2 | 'any-on' & 'one' | plainto_tsquery
any one | 1 | 'one' | plainto_tsquery
one-any | 2 | 'one-ani' & 'one' | plainto_tsquery
one any | 1 | 'one' | plainto_tsquery
(4 rows)
we can see that all searches should be performed using the old method withplainto_tsquery because the length of to_tsvector is not 0. This indicates that the results are either a single non-stopword stem (if separated by a space) or multiple non-stopword stems (if separated by a dash).
There was a problem hiding this comment.
hmm, but doesn't that mean then that the PR only solves the very specific case of searching for a single stopword?
There was a problem hiding this comment.
Unfortunately, yes, I was unaware that keyword: was available in the frontend, and the old issue is still open. 😓
Oh, indeed! Perhaps we should add a help button or document somewhere to inform users about some of these search tricks?
I lean towards just using keyword:. Feel free to close it as it's not planned.
There was a problem hiding this comment.
It's worth noting that if we don't limit the scope to keyword (:B) some stopwords like a, an or others that might frequently appear in the readme of multiple crates could become noise.
There was a problem hiding this comment.
I lean towards just using
keyword:. Feel free to close it as it's not planned.
okay, sorry for the wasted effort :-/
| CREATE OR REPLACE aggregate tsvector_agg (tsvector) ( | ||
| STYPE = pg_catalog.tsvector, | ||
| SFUNC = pg_catalog.tsvector_concat, | ||
| INITCOND = '' | ||
| ); |
There was a problem hiding this comment.
what does this do? could probably use a comment :)
There was a problem hiding this comment.
This is an aggregation function that combines multiple rows of tsvector data into a single tsvector using the tsvector concat operator (||).
e.g.
WITH expected AS (
SELECT
'macro:1'::tsvector || 'any:1'::tsvector AS concat
),
data as (
SELECT *
FROM (
VALUES
('macro:1' :: tsvector),
('any:1' :: tsvector)
) k(tv)
)
SELECT
( SELECT concat FROM expected ),
( SELECT tsvector_agg(tv) FROM data ) AS agg,
( SELECT concat FROM expected ) = (
SELECT tsvector_agg(tv) FROM data
) AS is_eq; concat | agg | is_eq
-------------------+-------------------+-------
'any':2 'macro':1 | 'any':2 'macro':1 | t
(1 row)
Oh, indeed! Perhaps we should add a help button or document somewhere to inform users about some of these search tricks? |
…textsearchable_index_col`
|
Noted: I'm using the fixup commit to avoid interrupting the review process. I expect an autosquash rebase before merging this PR. |
This PR adds support for searching query strings as keywords (weight B) in the
textsearchable_index_colcolumn if it is astopword.Should resolve #1407.
This PR should not impact query performance for
non-stopwordsearches. When searching with astopword, it should maintain similar query performance tonon-stopwordsearches.32.545 msCost:
14684.48..14684.606.187 msCost:
1979.71..1979.8334.072 msCost:
14684.48..14684.6033.486 msCost:
14660.48..14660.60Current non-stopword search
EXPLAIN ANALYZEProposed non-stopword
EXPLAIN ANALYZECurrent stopword search
EXPLAIN ANALYZEProposed stopword search
EXPLAIN ANALYZE