Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using only spaces as search tokenizer fails to process words with '-' character #6958

Closed
4 tasks done
galthaus opened this issue Mar 24, 2024 · 10 comments
Closed
4 tasks done
Labels
resolved by config change Issue can be mitigated by the reporter

Comments

@galthaus
Copy link

Context

Our docs use a lot of words with hyphens. The performance is very slow when it finds then and because it tokenizes each part the system biases the answers to the single words and not the total sets. We then tried to only use a space-based tokenizer and nothing matches anymore.

Bug description

Using only a space tokenizer, the system doesn't find words with hyphens in them.

Related links

Reproduction

9.5.15-search-no-hypen-fails.zip

Steps to reproduce

  1. Serve the docs
  2. Search for universal-image-deploy and find nothing.
  3. Change the search option to the "original"
  4. See that universal-image-deploy is found but not as the highest item. On our system, the performance is 20-30 seconds to actually find the pages.

Browser

Chrome, Edge

Before submitting

@squidfunk
Copy link
Owner

squidfunk commented Mar 24, 2024

Thanks for reporting. I've ran your reproduction and can confirm that when no hyphen is used as the search separator, nothing is found. This is very, very likely related to #6885 (reply in thread) (item 2.) and not fixable at the moment for the reasons stated in that comment. However, you might have noticed that we're working on #6307, which will fix this issue as well. I've also ran our latest search preview (#6372) and it fixes the issue, allowing to search with or without -:

Bildschirm­foto 2024-03-24 um 11 49 40

If I, as you mentioned, switch to the what you call "original" separator in our current implementation, I can confirm that search works and I do not observe the item being rendered as the second result:

Bildschirm­foto 2024-03-24 um 11 45 10

Note that we're working heavily on improving search result ranking as well, which should also be better in #6372. Until then, we're considering this issue as resolvable with a configuration (separator) change. You can follow #6307 for updates on the new search implementation, which should fix many, many shortcomings of the current implementation.

@squidfunk squidfunk added the resolved by config change Issue can be mitigated by the reporter label Mar 24, 2024
@squidfunk
Copy link
Owner

On another note:

On our system, the performance is 20-30 seconds to actually find the pages.

Is the performance the same if you use the search preview (#6372)? How many pages is your documentation composed of? How long does the build take? Searching should not take 20-30s but 20-30ms, and you can help us trying to understand where this comes from by providing us with more information, and ideally, with a test case. Are your docs public?

@squidfunk
Copy link
Owner

Alternatively, if you could share the search/search_index.json file that is located in your site directory after building – that would be a tremendous help. It is public anyway if you deploy your site to GitHub Pages. You can just post the link here, as it would help me better understand what the problem is. If you could also provide some searches that lead to suboptimal results on that dataset, that'd be absolutely amazing and of great help ☺️

@galthaus
Copy link
Author

@squidfunk - Our docs can 200+ pages. We've split the site into two, but it is still a lot. The build of both sites can take 45 minutes. The problem with the "original" search delimiter is not that things aren't found, but they are biased in the current mechanism to push the set of items down that match the "whole" string. So, universal-image-deploy finds universal and image and deploy and image deploy before universal-image-deploy and that is really annoying. The ordering problem becomes more apparent with lots of pages.

docs.rackn.io is our current site. https://docs.rackn.io/stable/ search universal-hardware - it takes about 3 seconds for the preview window to stabilize. I think the longer times are on slower links and maybe first search.

We are using an older version because I need to figure out how to get the latest to work. I have hacked our docs to make it work for the reproduction case. The current builds fail on our tree because the tag system now seems to not be able to consume tags with hyphens in them. I'll see if I can make a case for that.

Thanks for your feedback. I'll see if I can try the preview.

@squidfunk
Copy link
Owner

squidfunk commented Mar 25, 2024

Our docs can 200+ pages. We've split the site into two, but it is still a lot. The build of both sites can take 45 minutes.

45 minutes are definitely unexpected. Material for MkDocs own documentation has more than 90 pages and takes 4 seconds to build. It may be caused by some third party plugin or extension you're using. It'd be definitely worth debugging what causes this. A good idea is to disable plugins and extensions one-by-one and see what causes this.

The problem with the "original" search delimiter is not that things aren't found, but they are biased in the current mechanism to push the set of items down that match the "whole" string. So, universal-image-deploy finds universal and image and deploy and image deploy before universal-image-deploy and that is really annoying. The ordering problem becomes more apparent with lots of pages.

Yes, ranking is currently not optimal. The existing implementation is based on BM25, which is not ideal for typeahead. The search preview uses a variant of BM25 giving more weight to consecutive matches, so it might already improve the situation. We're working hard on a new ranking method that does not suffer from the problems of BM25.

docs.rackn.io is our current site. https://docs.rackn.io/stable/ search universal-hardware - it takes about 3 seconds for the preview window to stabilize. I think the longer times are on slower links and maybe first search.

The search feels reasonably snappy to me. Yes, it could be even faster (and the search preview actually should be), but I don't observe that opening the search modal or searching takes 3 seconds. I'll download your search index and check if I somehow run into pathological cases.

We are using an older version because I need to figure out how to get the latest to work. I have hacked our docs to make it work for the reproduction case. The current builds fail on our tree because the tag system now seems to not be able to consume tags with hyphens in them. I'll see if I can make a case for that.

Jup, 9.2.3 is a little old, but there have not been many changes to search, so don't expect too much when upgrading. However, as mentioned, following #6307 is a good idea, which will improve the situation. Regardless, it's always a good idea to try and stay updated, since we're iterating fast while trying to keep it as stable as possible.

The current builds fail on our tree because the tag system now seems to not be able to consume tags with hyphens in them.

The tags plugin in Insiders got a complete makeover, as discussed in #6517. If you can narrow the problems down and create a reproduction, we'd be happy if you can create a new bug report so we can fix it ☺️

@galthaus
Copy link
Author

Sorry. The dev scope we limit to 600 pages for build times. The 600 pages builds in about 21.68 seconds. The full scope of generated docs is 6000 pages. That takes a while to build, 1165.73 seconds. It appears that mkdocs is faster with the last builds. Still not fun, but getting better. I'll play with plugins and get you a repro on the tags things. Opened an insiders ticket for the tag build issue.

@galthaus
Copy link
Author

Here is the slower site. It has 6000 pages. https://refs.rackn.io/stable and search using the preview for universal-hardware or universal-discover. It appears to take 20 seconds to stabilize. The latest tree (but not the search rewrite) is faster, but still takes 10 seconds or so to stabilize. It flashes through sequences. My guess is that it threaded and is processing the keystrokes and bounces. The latest tree does sort better (well a little). It depends upon the search term.

@squidfunk
Copy link
Owner

Here is the slower site. It has 6000 pages.

6,000 pages is a whole other level, so it sounds legit that this takes longer. Just as an idea to cut down on build time: you might try to enable navigation pruning, which, depending on how you structured the site, might help in cutting down the size and time of the build, because the navigation plays a large role. Also see #1887 for reference.

Thanks for opening the ticket, we'll look into it.

I'm not surprised. Your search index is 40 MB, so you pretty much reached the end of client-side search, as you're shipping this index to every user. We haven't announced this yet, but we'll likely be offering the ability to provide server-side search and fully integrate it with the search interface in the near future. Additionally, we'll be exploring alternative methods of breaking down the index in order to ship smaller chunks to the user, and not the entire thing. A site of this size is just not suited anymore for full client-side search.

To sum up: we are very aware of the problem that with a growing site, search degrades, and will actively address this in the future after the shipped the first iterations of the new search interface. Our vision is to provide an awesome experience from 1 to 10,000 pages. Please note that this is a pretty big fish to fry, but we're working hard on it.

Bildschirm­foto 2024-03-26 um 08 44 35

Based on this search index, could you share some searches + the results you would expect and how they are sorted? That would allow us to better test it.

@squidfunk
Copy link
Owner

squidfunk commented Mar 26, 2024

Thanks again for sharing your site. It helps a lot in gravitating towards a better search implementation ☺️

When I run my current prototype on the 40 MB search index of your site, indexing takes around 2-3 seconds and searching takes less than 100ms on average, which includes searching, ranking (please ignore score = 0 in the video below), ordering, highlighting and pagination. It looks very promising and feels quite snappy, given that there are 6,000 documents, each of which with multiple sections, leading to a total number of 16,000 items in the search index

Ohne.Titel.mp4

When entering a few characters, many, many results are returned, which might bury what you actually search for among many similar results. In this case, a scoped search might be a better idea, in order to prune the number of potential results prior to searching by a categorical system like tags or site subsections (Blog, Reference, etc.).

All of this is currently in movement, and I'll be regularly testing your search index. Please note that a search index with 16,000 items is far, far beyond what we've yet observed in a site, so it might take some time to get this right, but I can assure you that it is on our agenda.

Edit: prior comment said 26,000, but it's 16,000 items and 26,000 distinct terms. Sorry for that. It's, however, still the biggest search index we've seen so far.

@galthaus
Copy link
Author

Glad it could help. I'll look at navigation pruning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
resolved by config change Issue can be mitigated by the reporter
Projects
None yet
Development

No branches or pull requests

2 participants