Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lucene-monitor: make abstract DocumentBatch public #13993

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

cpoerschke
Copy link
Contributor

@cpoerschke cpoerschke commented Nov 13, 2024

Description

The static DocumentBatch.of methods are already public, if the class itself was public too that would allow applications -- e.g. see @kotman12's apache/solr#2382 -- to refer to the class e.g. in a visitor.

@cpoerschke cpoerschke marked this pull request as ready for review November 13, 2024 19:26
@romseygeek
Copy link
Contributor

DocumentBatch is really an implementation detail of the Monitor, so I'm not sure why client code would need to refer to it? The linked Solr PR is quite big so it's difficult to see where this would be helpful, could you explain in a bit more detail why it's necessary?

@kotman12
Copy link
Contributor

kotman12 commented Nov 14, 2024

Author here, so the current monitor API makes it really hard to integrate into anything more "custom" like solr. This is because it tightly seals relevant implementation details like cache and index and makes opinionated choices about them, i.e. the cache always caches everything and is a naive hashmap, the index is not exposed and can't be passed in, and the indexing and search flows both hit coarse synchronization that creates blocking.

The simple crud wrapper that monitor exposes today IMO doesn't make it amenable to be integrated into something like solr (maybe also part of the reason why elastic doesn't use it for percolator?). Tldr; The PR takes the most useful parts of monitor and uses them directly but relies on the visitor pattern. We'd like to avoid that. Not sure if there is a better way of changing the API to achieve the same thing? I welcome other thoughts and suggestions.

@romseygeek
Copy link
Contributor

I'm definitely up for making this more extensible. We already have two separate QueryIndex implementations, so maybe that's the best place to start? Cacheing of parsed queries is pretty tightly coupled to how the query index is implemented so separating them out might be trickier, but allowing clients to use their own QueryIndex implementations would allow some experimentation there.

@kotman12
Copy link
Contributor

Cacheing of parsed queries is pretty tightly coupled to how the query index is implemented so separating them out might be trickier

The current "solr monitor" PR avoids this problem entirely by just not using the monitor and internal caches altogether. Really the best part of the monitor module is the presearcher with its various optimizations. One of the problems is that to go to it you have to go through the coarse synchronization which is done to update the naive cache and index at the same time. The linked PR does no such synchronization and manages its own caffeine cache which is optimistically updated. Versioning is used to make sure the index matches the cache, and if not, the cache is updated (assuming cache version is lower).

@cpoerschke
Copy link
Contributor Author

Opened #13995 as an alternative i.e. of method visibility to match the class. WDYT?

@kotman12
Copy link
Contributor

Following up briefly, I just don't see a lot of use for a QueryIndex or a Monitor in a more "advanced" set-up such a Solr. The QueryIndex api is quite large and most of the operations aren't really necessary. The useful bits are the QueryTermFilter (and perhaps DataValues). For the common use case that doesn't have to worry about scale or replication Monitor api works fine. But when trying to adapt it to something more advanced like solr, trying to shoehorn QueryIndex seems like a large undertaking for not much gain (at least I don't perceive much). You'd have to bypass or amend SolrIndexSearcher for a workflow that really isn't all that different from regular search. The main place where it deviates is the expensive post-matching step for which solr already has abstractions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants