Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for searching stopwords as keywords #9117

Closed
wants to merge 3 commits into from

Conversation

eth3lbert
Copy link
Contributor

This PR adds support for searching query strings as keywords (weight B) in the textsearchable_index_col column if it is a stopword.

Should resolve #1407.


This PR should not impact query performance for non-stopword searches. When searching with a stopword, it should maintain similar query performance to non-stopword searches.

non-stopword stopword
current Execution Time: 32.545 ms
Cost: 14684.48..14684.60
Execution Time: 6.187 ms
Cost: 1979.71..1979.83
proposed Execution Time: 34.072 ms
Cost: 14684.48..14684.60
Execution Time: 33.486 ms
Cost: 14660.48..14660.60
Current non-stopword search EXPLAIN ANALYZE
Limit  (cost=14684.48..14684.60 rows=10 width=227) (actual time=32.429..32.435 rows=10 loops=1)
  Output: t.id, t.name, t.updated_at, t.created_at, t.description, t.homepage, t.documentation, t.repository, t.max_upload_size, t.max_features, t."?column?", t.downloads, t.downloads_1, t.ts_rank_cd, ($0)
  Buffers: shared hit=3833
  InitPlan 1 (returns $0)
    ->  Aggregate  (cost=4589.17..4589.18 rows=1 width=8) (actual time=1.948..1.949 rows=1 loops=1)
          Output: count(*)
          Buffers: shared hit=761
          ->  Bitmap Heap Scan on public.crates crates_1  (cost=2058.13..4587.26 rows=766 width=0) (actual time=1.726..1.935 rows=192 loops=1)
                Recheck Cond: (('''flex'''::tsquery @@ crates_1.textsearchable_index_col) OR (replace(lower((crates_1.name)::text), '-'::text, '_'::text) ~~ '%flex%'::text))
                Heap Blocks: exact=189
                Buffers: shared hit=761
                ->  BitmapOr  (cost=2058.13..2058.13 rows=767 width=0) (actual time=1.712..1.713 rows=0 loops=1)
                      Buffers: shared hit=513
                      ->  Bitmap Index Scan on index_crates_name_search  (cost=0.00..1241.64 rows=752 width=0) (actual time=0.546..0.546 rows=134 loops=1)
                            Index Cond: (crates_1.textsearchable_index_col @@ '''flex'''::tsquery)
                            Buffers: shared hit=309
                      ->  Bitmap Index Scan on index_crates_name_tgrm  (cost=0.00..816.11 rows=15 width=0) (actual time=1.165..1.165 rows=84 loops=1)
                            Index Cond: (replace(lower((crates_1.name)::text), '-'::text, '_'::text) ~~ '%flex%'::text)
                            Buffers: shared hit=204
  ->  Subquery Scan on t  (cost=10095.29..10104.60 rows=745 width=227) (actual time=32.428..32.432 rows=10 loops=1)
        Output: t.id, t.name, t.updated_at, t.created_at, t.description, t.homepage, t.documentation, t.repository, t.max_upload_size, t.max_features, t."?column?", t.downloads, t.downloads_1, t.ts_rank_cd, $0
        Buffers: shared hit=3833
        ->  Sort  (cost=10095.29..10097.15 rows=745 width=219) (actual time=30.475..30.477 rows=10 loops=1)
              Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'flex'::text)), crate_downloads.downloads, recent_crate_downloads.downloads, (ts_rank_cd(crates.textsearchable_index_col, '''flex'''::tsquery))
              Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'flex'::text)) DESC, (ts_rank_cd(crates.textsearchable_index_col, '''flex'''::tsquery)) DESC, crates.name
              Sort Method: top-N heapsort  Memory: 30kB
              Buffers: shared hit=3072
              ->  Hash Right Join  (cost=7243.36..10059.75 rows=745 width=219) (actual time=16.221..30.360 rows=192 loops=1)
                    Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, (replace(lower((crates.name)::text), '-'::text, '_'::text) = 'flex'::text), crate_downloads.downloads, recent_crate_downloads.downloads, ts_rank_cd(crates.textsearchable_index_col, '''flex'''::tsquery)
                    Hash Cond: (recent_crate_downloads.crate_id = crates.id)
                    Buffers: shared hit=3061
                    ->  Seq Scan on public.recent_crate_downloads  (cost=0.00..2253.32 rows=146232 width=12) (actual time=0.005..5.630 rows=146232 loops=1)
                          Output: recent_crate_downloads.crate_id, recent_crate_downloads.downloads
                          Buffers: shared hit=791
                    ->  Hash  (cost=7234.05..7234.05 rows=745 width=413) (actual time=16.149..16.150 rows=192 loops=1)
                          Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, crates.textsearchable_index_col, crate_downloads.downloads
                          Buckets: 1024  Batches: 1  Memory Usage: 75kB
                          Buffers: shared hit=1677
                          ->  Hash Join  (cost=4596.83..7234.05 rows=745 width=413) (actual time=3.198..16.079 rows=192 loops=1)
                                Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, crates.textsearchable_index_col, crate_downloads.downloads
                                Inner Unique: true
                                Hash Cond: (crate_downloads.crate_id = crates.id)
                                Buffers: shared hit=1677
                                ->  Seq Scan on public.crate_downloads  (cost=0.00..2253.34 rows=146234 width=12) (actual time=0.003..5.462 rows=146234 loops=1)
                                      Output: crate_downloads.crate_id, crate_downloads.downloads
                                      Buffers: shared hit=791
                                ->  Hash  (cost=4587.26..4587.26 rows=766 width=405) (actual time=3.134..3.135 rows=192 loops=1)
                                      Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, crates.textsearchable_index_col
                                      Buckets: 1024  Batches: 1  Memory Usage: 73kB
                                      Buffers: shared hit=886
                                      ->  Bitmap Heap Scan on public.crates  (cost=2058.13..4587.26 rows=766 width=405) (actual time=2.466..3.072 rows=192 loops=1)
                                            Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, crates.textsearchable_index_col
                                            Recheck Cond: (('''flex'''::tsquery @@ crates.textsearchable_index_col) OR (replace(lower((crates.name)::text), '-'::text, '_'::text) ~~ '%flex%'::text))
                                            Heap Blocks: exact=189
                                            Buffers: shared hit=886
                                            ->  BitmapOr  (cost=2058.13..2058.13 rows=767 width=0) (actual time=2.447..2.447 rows=0 loops=1)
                                                  Buffers: shared hit=513
                                                  ->  Bitmap Index Scan on index_crates_name_search  (cost=0.00..1241.64 rows=752 width=0) (actual time=0.995..0.995 rows=134 loops=1)
                                                        Index Cond: (crates.textsearchable_index_col @@ '''flex'''::tsquery)
                                                        Buffers: shared hit=309
                                                  ->  Bitmap Index Scan on index_crates_name_tgrm  (cost=0.00..816.11 rows=15 width=0) (actual time=1.452..1.452 rows=84 loops=1)
                                                        Index Cond: (replace(lower((crates.name)::text), '-'::text, '_'::text) ~~ '%flex%'::text)
                                                        Buffers: shared hit=204
Planning:
  Buffers: shared hit=553
Planning Time: 2.856 ms
Execution Time: 32.545 ms
Proposed non-stopword EXPLAIN ANALYZE
Limit  (cost=14684.48..14684.60 rows=10 width=227) (actual time=33.943..33.949 rows=10 loops=1)
  Output: t.id, t.name, t.updated_at, t.created_at, t.description, t.homepage, t.documentation, t.repository, t.max_upload_size, t.max_features, t."?column?", t.downloads, t.downloads_1, t.ts_rank_cd, ($0)
  Buffers: shared hit=3833
  InitPlan 1 (returns $0)
    ->  Aggregate  (cost=4589.17..4589.18 rows=1 width=8) (actual time=2.301..2.302 rows=1 loops=1)
          Output: count(*)
          Buffers: shared hit=761
          ->  Bitmap Heap Scan on public.crates crates_1  (cost=2058.13..4587.26 rows=766 width=0) (actual time=2.055..2.293 rows=192 loops=1)
                Recheck Cond: (('''flex'''::tsquery @@ crates_1.textsearchable_index_col) OR (replace(lower((crates_1.name)::text), '-'::text, '_'::text) ~~ '%flex%'::text))
                Heap Blocks: exact=189
                Buffers: shared hit=761
                ->  BitmapOr  (cost=2058.13..2058.13 rows=767 width=0) (actual time=2.038..2.038 rows=0 loops=1)
                      Buffers: shared hit=513
                      ->  Bitmap Index Scan on index_crates_name_search  (cost=0.00..1241.64 rows=752 width=0) (actual time=0.628..0.628 rows=134 loops=1)
                            Index Cond: (crates_1.textsearchable_index_col @@ '''flex'''::tsquery)
                            Buffers: shared hit=309
                      ->  Bitmap Index Scan on index_crates_name_tgrm  (cost=0.00..816.11 rows=15 width=0) (actual time=1.409..1.410 rows=84 loops=1)
                            Index Cond: (replace(lower((crates_1.name)::text), '-'::text, '_'::text) ~~ '%flex%'::text)
                            Buffers: shared hit=204
  ->  Subquery Scan on t  (cost=10095.29..10104.60 rows=745 width=227) (actual time=33.942..33.946 rows=10 loops=1)
        Output: t.id, t.name, t.updated_at, t.created_at, t.description, t.homepage, t.documentation, t.repository, t.max_upload_size, t.max_features, t."?column?", t.downloads, t.downloads_1, t.ts_rank_cd, $0
        Buffers: shared hit=3833
        ->  Sort  (cost=10095.29..10097.15 rows=745 width=219) (actual time=31.638..31.639 rows=10 loops=1)
              Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'flex'::text)), crate_downloads.downloads, recent_crate_downloads.downloads, (ts_rank_cd(crates.textsearchable_index_col, '''flex'''::tsquery))
              Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'flex'::text)) DESC, (ts_rank_cd(crates.textsearchable_index_col, '''flex'''::tsquery)) DESC, crates.name
              Sort Method: top-N heapsort  Memory: 30kB
              Buffers: shared hit=3072
              ->  Hash Right Join  (cost=7243.36..10059.75 rows=745 width=219) (actual time=16.820..31.514 rows=192 loops=1)
                    Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, (replace(lower((crates.name)::text), '-'::text, '_'::text) = 'flex'::text), crate_downloads.downloads, recent_crate_downloads.downloads, ts_rank_cd(crates.textsearchable_index_col, '''flex'''::tsquery)
                    Hash Cond: (recent_crate_downloads.crate_id = crates.id)
                    Buffers: shared hit=3061
                    ->  Seq Scan on public.recent_crate_downloads  (cost=0.00..2253.32 rows=146232 width=12) (actual time=0.005..5.974 rows=146232 loops=1)
                          Output: recent_crate_downloads.crate_id, recent_crate_downloads.downloads
                          Buffers: shared hit=791
                    ->  Hash  (cost=7234.05..7234.05 rows=745 width=413) (actual time=16.738..16.739 rows=192 loops=1)
                          Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, crates.textsearchable_index_col, crate_downloads.downloads
                          Buckets: 1024  Batches: 1  Memory Usage: 75kB
                          Buffers: shared hit=1677
                          ->  Hash Join  (cost=4596.83..7234.05 rows=745 width=413) (actual time=3.101..16.655 rows=192 loops=1)
                                Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, crates.textsearchable_index_col, crate_downloads.downloads
                                Inner Unique: true
                                Hash Cond: (crate_downloads.crate_id = crates.id)
                                Buffers: shared hit=1677
                                ->  Seq Scan on public.crate_downloads  (cost=0.00..2253.34 rows=146234 width=12) (actual time=0.003..6.126 rows=146234 loops=1)
                                      Output: crate_downloads.crate_id, crate_downloads.downloads
                                      Buffers: shared hit=791
                                ->  Hash  (cost=4587.26..4587.26 rows=766 width=405) (actual time=3.036..3.037 rows=192 loops=1)
                                      Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, crates.textsearchable_index_col
                                      Buckets: 1024  Batches: 1  Memory Usage: 73kB
                                      Buffers: shared hit=886
                                      ->  Bitmap Heap Scan on public.crates  (cost=2058.13..4587.26 rows=766 width=405) (actual time=2.367..2.972 rows=192 loops=1)
                                            Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, crates.textsearchable_index_col
                                            Recheck Cond: (('''flex'''::tsquery @@ crates.textsearchable_index_col) OR (replace(lower((crates.name)::text), '-'::text, '_'::text) ~~ '%flex%'::text))
                                            Heap Blocks: exact=189
                                            Buffers: shared hit=886
                                            ->  BitmapOr  (cost=2058.13..2058.13 rows=767 width=0) (actual time=2.350..2.351 rows=0 loops=1)
                                                  Buffers: shared hit=513
                                                  ->  Bitmap Index Scan on index_crates_name_search  (cost=0.00..1241.64 rows=752 width=0) (actual time=0.972..0.972 rows=134 loops=1)
                                                        Index Cond: (crates.textsearchable_index_col @@ '''flex'''::tsquery)
                                                        Buffers: shared hit=309
                                                  ->  Bitmap Index Scan on index_crates_name_tgrm  (cost=0.00..816.11 rows=15 width=0) (actual time=1.378..1.378 rows=84 loops=1)
                                                        Index Cond: (replace(lower((crates.name)::text), '-'::text, '_'::text) ~~ '%flex%'::text)
                                                        Buffers: shared hit=204
Planning:
  Buffers: shared hit=559
Planning Time: 2.408 ms
Execution Time: 34.072 ms
Current stopword search EXPLAIN ANALYZE
Limit  (cost=1979.71..1979.83 rows=10 width=227) (actual time=6.090..6.095 rows=10 loops=1)
  Output: t.id, t.name, t.updated_at, t.created_at, t.description, t.homepage, t.documentation, t.repository, t.max_upload_size, t.max_features, t."?column?", t.downloads, t.downloads_1, t.ts_rank_cd, ($0)
  Buffers: shared hit=4439
  InitPlan 1 (returns $0)
    ->  Aggregate  (cost=863.08..863.09 rows=1 width=8) (actual time=1.326..1.326 rows=1 loops=1)
          Output: count(*)
          Buffers: shared hit=1175
          ->  Bitmap Heap Scan on public.crates crates_1  (cost=804.12..863.05 rows=15 width=0) (actual time=0.891..1.316 rows=212 loops=1)
                Recheck Cond: ((''::tsquery @@ crates_1.textsearchable_index_col) OR (replace(lower((crates_1.name)::text), '-'::text, '_'::text) ~~ '%any%'::text))
                Heap Blocks: exact=615
                Buffers: shared hit=1175
                ->  BitmapOr  (cost=804.12..804.12 rows=15 width=0) (actual time=0.835..0.835 rows=0 loops=1)
                      Buffers: shared hit=201
                      ->  Bitmap Index Scan on index_crates_name_search  (cost=0.00..0.00 rows=1 width=0) (actual time=0.000..0.000 rows=0 loops=1)
                            Index Cond: (crates_1.textsearchable_index_col @@ ''::tsquery)
                      ->  Bitmap Index Scan on index_crates_name_tgrm  (cost=0.00..804.11 rows=15 width=0) (actual time=0.835..0.835 rows=973 loops=1)
                            Index Cond: (replace(lower((crates_1.name)::text), '-'::text, '_'::text) ~~ '%any%'::text)
                            Buffers: shared hit=201
  ->  Subquery Scan on t  (cost=1116.61..1116.80 rows=15 width=227) (actual time=6.090..6.092 rows=10 loops=1)
        Output: t.id, t.name, t.updated_at, t.created_at, t.description, t.homepage, t.documentation, t.repository, t.max_upload_size, t.max_features, t."?column?", t.downloads, t.downloads_1, t.ts_rank_cd, $0
        Buffers: shared hit=4439
        ->  Sort  (cost=1116.61..1116.65 rows=15 width=219) (actual time=4.762..4.763 rows=10 loops=1)
              Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'any'::text)), crate_downloads.downloads, recent_crate_downloads.downloads, (ts_rank_cd(crates.textsearchable_index_col, ''::tsquery))
              Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'any'::text)) DESC, (ts_rank_cd(crates.textsearchable_index_col, ''::tsquery)) DESC, crates.name
              Sort Method: top-N heapsort  Memory: 29kB
              Buffers: shared hit=3264
              ->  Nested Loop Left Join  (cost=804.96..1116.32 rows=15 width=219) (actual time=1.314..4.622 rows=212 loops=1)
                    Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, (replace(lower((crates.name)::text), '-'::text, '_'::text) = 'any'::text), crate_downloads.downloads, recent_crate_downloads.downloads, ts_rank_cd(crates.textsearchable_index_col, ''::tsquery)
                    Inner Unique: true
                    Buffers: shared hit=3253
                    ->  Nested Loop  (cost=804.54..989.61 rows=15 width=413) (actual time=1.302..3.627 rows=212 loops=1)
                          Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, crates.textsearchable_index_col, crate_downloads.downloads
                          Inner Unique: true
                          Buffers: shared hit=2046
                          ->  Bitmap Heap Scan on public.crates  (cost=804.12..863.05 rows=15 width=405) (actual time=1.291..2.802 rows=212 loops=1)
                                Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.readme, crates.textsearchable_index_col, crates.repository, crates.max_upload_size, crates.max_features
                                Recheck Cond: ((''::tsquery @@ crates.textsearchable_index_col) OR (replace(lower((crates.name)::text), '-'::text, '_'::text) ~~ '%any%'::text))
                                Heap Blocks: exact=615
                                Buffers: shared hit=1198
                                ->  BitmapOr  (cost=804.12..804.12 rows=15 width=0) (actual time=1.187..1.188 rows=0 loops=1)
                                      Buffers: shared hit=201
                                      ->  Bitmap Index Scan on index_crates_name_search  (cost=0.00..0.00 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
                                            Index Cond: (crates.textsearchable_index_col @@ ''::tsquery)
                                      ->  Bitmap Index Scan on index_crates_name_tgrm  (cost=0.00..804.11 rows=15 width=0) (actual time=1.185..1.185 rows=973 loops=1)
                                            Index Cond: (replace(lower((crates.name)::text), '-'::text, '_'::text) ~~ '%any%'::text)
                                            Buffers: shared hit=201
                          ->  Index Scan using crate_downloads_pk on public.crate_downloads  (cost=0.42..8.44 rows=1 width=12) (actual time=0.004..0.004 rows=1 loops=212)
                                Output: crate_downloads.crate_id, crate_downloads.downloads
                                Index Cond: (crate_downloads.crate_id = crates.id)
                                Buffers: shared hit=848
                    ->  Index Scan using recent_crate_downloads_crate_id on public.recent_crate_downloads  (cost=0.42..8.44 rows=1 width=12) (actual time=0.003..0.003 rows=1 loops=212)
                          Output: recent_crate_downloads.crate_id, recent_crate_downloads.downloads
                          Index Cond: (recent_crate_downloads.crate_id = crates.id)
                          Buffers: shared hit=848
Planning:
  Buffers: shared hit=553
Planning Time: 2.759 ms
Execution Time: 6.187 ms
Proposed stopword search EXPLAIN ANALYZE
Limit  (cost=14660.48..14660.60 rows=10 width=227) (actual time=33.360..33.368 rows=10 loops=1)
  Output: t.id, t.name, t.updated_at, t.created_at, t.description, t.homepage, t.documentation, t.repository, t.max_upload_size, t.max_features, t."?column?", t.downloads, t.downloads_1, t.ts_rank_cd, ($0)
  Buffers: shared hit=5410
  InitPlan 1 (returns $0)
    ->  Aggregate  (cost=4577.17..4577.18 rows=1 width=8) (actual time=2.354..2.355 rows=1 loops=1)
          Output: count(*)
          Buffers: shared hit=1682
          ->  Bitmap Heap Scan on public.crates crates_1  (cost=2046.13..4575.26 rows=766 width=0) (actual time=1.527..2.341 rows=240 loops=1)
                Recheck Cond: (('''any'':B'::tsquery @@ crates_1.textsearchable_index_col) OR (replace(lower((crates_1.name)::text), '-'::text, '_'::text) ~~ '%any%'::text))
                Heap Blocks: exact=742
                Buffers: shared hit=1682
                ->  BitmapOr  (cost=2046.13..2046.13 rows=767 width=0) (actual time=1.471..1.471 rows=0 loops=1)
                      Buffers: shared hit=510
                      ->  Bitmap Index Scan on index_crates_name_search  (cost=0.00..1241.64 rows=752 width=0) (actual time=0.603..0.603 rows=1079 loops=1)
                            Index Cond: (crates_1.textsearchable_index_col @@ '''any'':B'::tsquery)
                            Buffers: shared hit=309
                      ->  Bitmap Index Scan on index_crates_name_tgrm  (cost=0.00..804.11 rows=15 width=0) (actual time=0.867..0.867 rows=973 loops=1)
                            Index Cond: (replace(lower((crates_1.name)::text), '-'::text, '_'::text) ~~ '%any%'::text)
                            Buffers: shared hit=201
  ->  Subquery Scan on t  (cost=10083.29..10092.60 rows=745 width=227) (actual time=33.359..33.365 rows=10 loops=1)
        Output: t.id, t.name, t.updated_at, t.created_at, t.description, t.homepage, t.documentation, t.repository, t.max_upload_size, t.max_features, t."?column?", t.downloads, t.downloads_1, t.ts_rank_cd, $0
        Buffers: shared hit=5410
        ->  Sort  (cost=10083.29..10085.15 rows=745 width=219) (actual time=31.002..31.006 rows=10 loops=1)
              Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'any'::text)), crate_downloads.downloads, recent_crate_downloads.downloads, (ts_rank_cd(crates.textsearchable_index_col, '''any'':B'::tsquery))
              Sort Key: ((replace(lower((crates.name)::text), '-'::text, '_'::text) = 'any'::text)) DESC, (ts_rank_cd(crates.textsearchable_index_col, '''any'':B'::tsquery)) DESC, crates.name
              Sort Method: top-N heapsort  Memory: 29kB
              Buffers: shared hit=3728
              ->  Hash Right Join  (cost=7231.36..10047.75 rows=745 width=219) (actual time=16.917..30.868 rows=240 loops=1)
                    Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, (replace(lower((crates.name)::text), '-'::text, '_'::text) = 'any'::text), crate_downloads.downloads, recent_crate_downloads.downloads, ts_rank_cd(crates.textsearchable_index_col, '''any'':B'::tsquery)
                    Hash Cond: (recent_crate_downloads.crate_id = crates.id)
                    Buffers: shared hit=3717
                    ->  Seq Scan on public.recent_crate_downloads  (cost=0.00..2253.32 rows=146232 width=12) (actual time=0.010..5.591 rows=146232 loops=1)
                          Output: recent_crate_downloads.crate_id, recent_crate_downloads.downloads
                          Buffers: shared hit=791
                    ->  Hash  (cost=7222.05..7222.05 rows=745 width=413) (actual time=16.885..16.887 rows=240 loops=1)
                          Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, crates.textsearchable_index_col, crate_downloads.downloads
                          Buckets: 1024  Batches: 1  Memory Usage: 110kB
                          Buffers: shared hit=2496
                          ->  Hash Join  (cost=4584.83..7222.05 rows=745 width=413) (actual time=3.994..16.797 rows=240 loops=1)
                                Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, crates.textsearchable_index_col, crate_downloads.downloads
                                Inner Unique: true
                                Hash Cond: (crate_downloads.crate_id = crates.id)
                                Buffers: shared hit=2496
                                ->  Seq Scan on public.crate_downloads  (cost=0.00..2253.34 rows=146234 width=12) (actual time=0.003..5.398 rows=146234 loops=1)
                                      Output: crate_downloads.crate_id, crate_downloads.downloads
                                      Buffers: shared hit=791
                                ->  Hash  (cost=4575.26..4575.26 rows=766 width=405) (actual time=3.966..3.967 rows=240 loops=1)
                                      Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, crates.textsearchable_index_col
                                      Buckets: 1024  Batches: 1  Memory Usage: 107kB
                                      Buffers: shared hit=1705
                                      ->  Bitmap Heap Scan on public.crates  (cost=2046.13..4575.26 rows=766 width=405) (actual time=2.135..3.883 rows=240 loops=1)
                                            Output: crates.id, crates.name, crates.updated_at, crates.created_at, crates.description, crates.homepage, crates.documentation, crates.repository, crates.max_upload_size, crates.max_features, crates.textsearchable_index_col
                                            Recheck Cond: (('''any'':B'::tsquery @@ crates.textsearchable_index_col) OR (replace(lower((crates.name)::text), '-'::text, '_'::text) ~~ '%any%'::text))
                                            Heap Blocks: exact=742
                                            Buffers: shared hit=1705
                                            ->  BitmapOr  (cost=2046.13..2046.13 rows=767 width=0) (actual time=2.073..2.074 rows=0 loops=1)
                                                  Buffers: shared hit=510
                                                  ->  Bitmap Index Scan on index_crates_name_search  (cost=0.00..1241.64 rows=752 width=0) (actual time=1.000..1.000 rows=1079 loops=1)
                                                        Index Cond: (crates.textsearchable_index_col @@ '''any'':B'::tsquery)
                                                        Buffers: shared hit=309
                                                  ->  Bitmap Index Scan on index_crates_name_tgrm  (cost=0.00..804.11 rows=15 width=0) (actual time=1.072..1.073 rows=973 loops=1)
                                                        Index Cond: (replace(lower((crates.name)::text), '-'::text, '_'::text) ~~ '%any%'::text)
                                                        Buffers: shared hit=201
Planning:
  Buffers: shared hit=562
Planning Time: 2.619 ms
Execution Time: 33.486 ms

Comment on lines +99 to 101
let qs = sql::<TsQuery>("plainto_tsquery('english', ")
.bind::<Text, _>(q_string)
.sql(")");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This could also be refactored to plainto_tsquery_with_search_config, which I've implemented upstream in diesel-rs/diesel_full_text_search#41. I incorrectly used to_tsquery_with_search_config in the #7941 , which is not equivalent to plainto_tsquery and caused an issue #8052 . And since carol has concerns about it so I haven't modified it in this PR.

Comment on lines +106 to +109
length(to_tsvector_with_search_config::<Text, _, _>(
TsConfigurationByName("english"),
q_string,
))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also construct the sql ourselves if we have concerns about it.

@eth3lbert
Copy link
Contributor Author

r? @Turbo87 @LawnGnome

Copy link

codecov bot commented Jul 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.24%. Comparing base (01a8ec8) to head (5d6f310).
Report is 20 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9117      +/-   ##
==========================================
+ Coverage   89.20%   89.24%   +0.04%     
==========================================
  Files         282      282              
  Lines       28513    28610      +97     
==========================================
+ Hits        25435    25534      +99     
+ Misses       3078     3076       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@eth3lbert
Copy link
Contributor Author

Additionally, while implementing this, I discovered a potential opportunity to improve the query speed of the search query. However, it would require leveraging a CTE clause, which necessitates a significant rewrite of the current implementation. I might consider refactoring this at a later time.

-- UPDATE crates
-- SET updated_at = updated_at
-- FROM keywords_with_stopwords
-- WHERE id = crate_id AND NOT (keyword || ':B')::tsquery @@ textsearchable_index_col
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since our keyword doesn't contain spaces, it's safe to directly cast it to a tsquery.

@Turbo87 Turbo87 added C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works A-backend ⚙️ labels Jul 19, 2024
@eth3lbert
Copy link
Contributor Author

In addition to searching the query string as a keyword directly (as implemented in this PR), we could also consider other approaches. These might include a github-style syntax with explicit keywords (e.g., keyword:qs) or a stackoverflow-style syntax using brackets (e.g., [qs]), or something more suitable. All of these approaches would require additional parsing steps, however.

@Turbo87
Copy link
Member

Turbo87 commented Jul 22, 2024

we could also consider other approaches. These might include a github-style syntax with explicit keywords (e.g., keyword:qs)

that is actually already supported, though only on the frontend, which transforms the keyword: prefix to a query param on the search query.

-- WHERE length(to_tsvector('english', keyword)) = 0
-- )
-- UPDATE crates
-- SET updated_at = updated_at
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not have to set any other columns? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because we don't want to modify the updated_at value. We only need to trigger trigger_crates_tsvector_update to update the textsearchable_index_col column, and I think this is sufficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to be sure, I assume you checked that this actually does not update the updated_at column? 😅

(I'm a bit confused by how the automatic update is currently implemented... 🫣)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm a bit confused by how the automatic update is currently implemented... 🫣)

Yes, this is quite opaque as it relies on a trigger behind the scenes.
We could inspect the triggers with the following sql:

SELECT event_object_table, action_order, trigger_name, event_manipulation
FROM information_schema.triggers
WHERE event_object_table = 'crates'
;
 event_object_table | action_order |              trigger_name              | event_manipulation 
--------------------+--------------+----------------------------------------+--------------------
 crates             |            1 | insert_crate_downloads_row             | INSERT
 crates             |            1 | trigger_crates_tsvector_update         | INSERT
 crates             |            2 | trigger_ensure_crate_name_not_reserved | INSERT
 crates             |            1 | trigger_crates_set_updated_at          | UPDATE
 crates             |            2 | trigger_crates_tsvector_update         | UPDATE
 crates             |            3 | trigger_ensure_crate_name_not_reserved | UPDATE
(6 rows)

just to be sure, I assume you checked that this actually does not update the updated_at column? 😅

And yes, the results of the following, which updates regardless of whether the record exists or not, indicate that no updated_at value is currently current_datetime:

date && psql cargo_registry <<EOF
WITH keywords_with_stopwords as (
        SELECT crate_id, keyword
        FROM keywords JOIN crates_keywords ON id = keyword_id
        WHERE length(to_tsvector('english', keyword)) = 0
), upt as (
  UPDATE crates
  SET updated_at = updated_at
  FROM keywords_with_stopwords
  WHERE id = crate_id
  -- AND NOT (keyword || ':B')::tsquery @@ textsearchable_index_col
  returning id
)
SELECT min(updated_at), max(updated_at), count(*)
FROM crates
WHERE id in (SELECT * from upt)
;

EOF
Tue Jul 23 16:07:26 CST 2024
            min             |            max             | count 
----------------------------+----------------------------+-------
 2015-12-16 00:01:49.263868 | 2024-07-13 01:40:09.733953 |   396
(1 row)

Comment on lines +96 to +98
// If the query string is not a stop word, search using `plainto_tsquery(...)`.
// Else if the it is a valid keyword, search by casting it to `tsquery` with weight B(keyword).
// Otherwise, search using `null::tsquery`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if the query string consists of multiple terms and only a subset of them are stopwords?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consider two terms separated by either a dash or a space.
With the following sql:

WITH data AS (    
  SELECT *
  FROM                                                
    (
      VALUES
        ('any-one'),
        ('any one'),
        ('one-any'),
        ('one any')           
    ) q(s)                 
)
SELECT
  s,                              
  length(                                      
    to_tsvector('english', s)        
  ),                                 
  plainto_tsquery('english', s),
  CASE WHEN (
    (
      length(
        to_tsvector('english', s)
      ) != 0
    )
  ) THEN ('plainto_tsquery')
  WHEN ('t') THEN ('cast')
  ELSE 'null' END
FROM  data;
    s    | length |  plainto_tsquery  |      case       
---------+--------+-------------------+-----------------
 any-one |      2 | 'any-on' & 'one'  | plainto_tsquery
 any one |      1 | 'one'             | plainto_tsquery
 one-any |      2 | 'one-ani' & 'one' | plainto_tsquery
 one any |      1 | 'one'             | plainto_tsquery
(4 rows)

we can see that all searches should be performed using the old method withplainto_tsquery because the length of to_tsvector is not 0. This indicates that the results are either a single non-stopword stem (if separated by a space) or multiple non-stopword stems (if separated by a dash).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, but doesn't that mean then that the PR only solves the very specific case of searching for a single stopword?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, yes, I was unaware that keyword: was available in the frontend, and the old issue is still open. 😓


#9117 (comment)

Oh, indeed! Perhaps we should add a help button or document somewhere to inform users about some of these search tricks?

I lean towards just using keyword:. Feel free to close it as it's not planned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that if we don't limit the scope to keyword (:B) some stopwords like a, an or others that might frequently appear in the readme of multiple crates could become noise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean towards just using keyword:. Feel free to close it as it's not planned.

okay, sorry for the wasted effort :-/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries 🙂‍↔️

Comment on lines +1 to +5
CREATE OR REPLACE aggregate tsvector_agg (tsvector) (
STYPE = pg_catalog.tsvector,
SFUNC = pg_catalog.tsvector_concat,
INITCOND = ''
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do? could probably use a comment :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an aggregation function that combines multiple rows of tsvector data into a single tsvector using the tsvector concat operator (||).

e.g.

WITH expected AS (
  SELECT
    'macro:1'::tsvector || 'any:1'::tsvector AS concat
),
data as (
  SELECT *
  FROM (
    VALUES
      ('macro:1' :: tsvector),
      ('any:1' :: tsvector)
  ) k(tv)
)
SELECT
  ( SELECT concat FROM expected ),
  ( SELECT tsvector_agg(tv) FROM data ) AS agg,
  ( SELECT concat FROM expected ) = (
    SELECT tsvector_agg(tv) FROM data
  ) AS is_eq;
      concat       |        agg        | is_eq
-------------------+-------------------+-------
 'any':2 'macro':1 | 'any':2 'macro':1 | t
(1 row)

@eth3lbert
Copy link
Contributor Author

that is actually already supported, though only on the frontend, which transforms the keyword: prefix to a query param on the search query.

Oh, indeed! Perhaps we should add a help button or document somewhere to inform users about some of these search tricks?

@eth3lbert
Copy link
Contributor Author

Noted: I'm using the fixup commit to avoid interrupting the review process. I expect an autosquash rebase before merging this PR.

@Turbo87 Turbo87 closed this Jul 24, 2024
@eth3lbert eth3lbert deleted the search-stopword branch July 24, 2024 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend ⚙️ C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Searching for "any" can't find "mopa" crate.
2 participants