-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for searching stopwords as keywords #9117
Conversation
let qs = sql::<TsQuery>("plainto_tsquery('english', ") | ||
.bind::<Text, _>(q_string) | ||
.sql(")"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This could also be refactored to plainto_tsquery_with_search_config
, which I've implemented upstream in diesel-rs/diesel_full_text_search#41. I incorrectly used to_tsquery_with_search_config
in the #7941 , which is not equivalent to plainto_tsquery
and caused an issue #8052 . And since carol has concerns about it so I haven't modified it in this PR.
length(to_tsvector_with_search_config::<Text, _, _>( | ||
TsConfigurationByName("english"), | ||
q_string, | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also construct the sql ourselves if we have concerns about it.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #9117 +/- ##
==========================================
+ Coverage 89.20% 89.24% +0.04%
==========================================
Files 282 282
Lines 28513 28610 +97
==========================================
+ Hits 25435 25534 +99
+ Misses 3078 3076 -2 ☔ View full report in Codecov by Sentry. |
Additionally, while implementing this, I discovered a potential opportunity to improve the query speed of the search query. However, it would require leveraging a |
-- UPDATE crates | ||
-- SET updated_at = updated_at | ||
-- FROM keywords_with_stopwords | ||
-- WHERE id = crate_id AND NOT (keyword || ':B')::tsquery @@ textsearchable_index_col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since our keyword doesn't contain spaces, it's safe to directly cast it to a tsquery
.
In addition to searching the query string as a |
that is actually already supported, though only on the frontend, which transforms the |
-- WHERE length(to_tsvector('english', keyword)) = 0 | ||
-- ) | ||
-- UPDATE crates | ||
-- SET updated_at = updated_at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we not have to set any other columns? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because we don't want to modify the updated_at
value. We only need to trigger trigger_crates_tsvector_update
to update the textsearchable_index_col
column, and I think this is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to be sure, I assume you checked that this actually does not update the updated_at
column? 😅
(I'm a bit confused by how the automatic update is currently implemented... 🫣)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm a bit confused by how the automatic update is currently implemented... 🫣)
Yes, this is quite opaque as it relies on a trigger behind the scenes.
We could inspect the triggers with the following sql:
SELECT event_object_table, action_order, trigger_name, event_manipulation
FROM information_schema.triggers
WHERE event_object_table = 'crates'
;
event_object_table | action_order | trigger_name | event_manipulation
--------------------+--------------+----------------------------------------+--------------------
crates | 1 | insert_crate_downloads_row | INSERT
crates | 1 | trigger_crates_tsvector_update | INSERT
crates | 2 | trigger_ensure_crate_name_not_reserved | INSERT
crates | 1 | trigger_crates_set_updated_at | UPDATE
crates | 2 | trigger_crates_tsvector_update | UPDATE
crates | 3 | trigger_ensure_crate_name_not_reserved | UPDATE
(6 rows)
just to be sure, I assume you checked that this actually does not update the
updated_at
column? 😅
And yes, the results of the following, which updates regardless of whether the record exists or not, indicate that no updated_at
value is currently current_datetime
:
date && psql cargo_registry <<EOF
WITH keywords_with_stopwords as (
SELECT crate_id, keyword
FROM keywords JOIN crates_keywords ON id = keyword_id
WHERE length(to_tsvector('english', keyword)) = 0
), upt as (
UPDATE crates
SET updated_at = updated_at
FROM keywords_with_stopwords
WHERE id = crate_id
-- AND NOT (keyword || ':B')::tsquery @@ textsearchable_index_col
returning id
)
SELECT min(updated_at), max(updated_at), count(*)
FROM crates
WHERE id in (SELECT * from upt)
;
EOF
Tue Jul 23 16:07:26 CST 2024
min | max | count
----------------------------+----------------------------+-------
2015-12-16 00:01:49.263868 | 2024-07-13 01:40:09.733953 | 396
(1 row)
// If the query string is not a stop word, search using `plainto_tsquery(...)`. | ||
// Else if the it is a valid keyword, search by casting it to `tsquery` with weight B(keyword). | ||
// Otherwise, search using `null::tsquery`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if the query string consists of multiple terms and only a subset of them are stopwords?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's consider two terms separated by either a dash or a space.
With the following sql:
WITH data AS (
SELECT *
FROM
(
VALUES
('any-one'),
('any one'),
('one-any'),
('one any')
) q(s)
)
SELECT
s,
length(
to_tsvector('english', s)
),
plainto_tsquery('english', s),
CASE WHEN (
(
length(
to_tsvector('english', s)
) != 0
)
) THEN ('plainto_tsquery')
WHEN ('t') THEN ('cast')
ELSE 'null' END
FROM data;
s | length | plainto_tsquery | case
---------+--------+-------------------+-----------------
any-one | 2 | 'any-on' & 'one' | plainto_tsquery
any one | 1 | 'one' | plainto_tsquery
one-any | 2 | 'one-ani' & 'one' | plainto_tsquery
one any | 1 | 'one' | plainto_tsquery
(4 rows)
we can see that all searches should be performed using the old method withplainto_tsquery
because the length
of to_tsvector
is not 0
. This indicates that the results are either a single non-stopword stem (if separated by a space) or multiple non-stopword stems (if separated by a dash).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, but doesn't that mean then that the PR only solves the very specific case of searching for a single stopword?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, yes, I was unaware that keyword:
was available in the frontend, and the old issue is still open. 😓
Oh, indeed! Perhaps we should add a help button or document somewhere to inform users about some of these search tricks?
I lean towards just using keyword:
. Feel free to close it as it's not planned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that if we don't limit the scope to keyword (:B
) some stopwords like a
, an
or others that might frequently appear in the readme
of multiple crates could become noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lean towards just using
keyword:
. Feel free to close it as it's not planned.
okay, sorry for the wasted effort :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries 🙂
CREATE OR REPLACE aggregate tsvector_agg (tsvector) ( | ||
STYPE = pg_catalog.tsvector, | ||
SFUNC = pg_catalog.tsvector_concat, | ||
INITCOND = '' | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this do? could probably use a comment :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an aggregation function that combines multiple rows of tsvector
data into a single tsvector
using the tsvector
concat
operator (||
).
e.g.
WITH expected AS (
SELECT
'macro:1'::tsvector || 'any:1'::tsvector AS concat
),
data as (
SELECT *
FROM (
VALUES
('macro:1' :: tsvector),
('any:1' :: tsvector)
) k(tv)
)
SELECT
( SELECT concat FROM expected ),
( SELECT tsvector_agg(tv) FROM data ) AS agg,
( SELECT concat FROM expected ) = (
SELECT tsvector_agg(tv) FROM data
) AS is_eq;
concat | agg | is_eq
-------------------+-------------------+-------
'any':2 'macro':1 | 'any':2 'macro':1 | t
(1 row)
Oh, indeed! Perhaps we should add a help button or document somewhere to inform users about some of these search tricks? |
…textsearchable_index_col`
Noted: I'm using the fixup commit to avoid interrupting the review process. I expect an autosquash rebase before merging this PR. |
This PR adds support for searching query strings as keywords (weight B) in the
textsearchable_index_col
column if it is astopword
.Should resolve #1407.
This PR should not impact query performance for
non-stopword
searches. When searching with astopword
, it should maintain similar query performance tonon-stopword
searches.32.545 ms
Cost:
14684.48..14684.60
6.187 ms
Cost:
1979.71..1979.83
34.072 ms
Cost:
14684.48..14684.60
33.486 ms
Cost:
14660.48..14660.60
Current non-stopword search
EXPLAIN ANALYZE
Proposed non-stopword
EXPLAIN ANALYZE
Current stopword search
EXPLAIN ANALYZE
Proposed stopword search
EXPLAIN ANALYZE