SOLR-4587: integrate lucene-monitor into solr #2382

kotman12 · 2024-04-01T17:52:51Z

https://issues.apache.org/jira/browse/SOLR-4587

Description

The module hopes to simplify distribution and scaling query-indexes for monitoring and alerting workflows (also known as reverse search) by providing a bridge between solr-managed search index and lucene-monitor's efficient reverse search algorithms.

Here is some evidence that the community might find this useful.

Blog-post that partly inspired the current approach
Users asking about a percolator-like feature on stackoverflow.
Someone contributed this extension but it doesn't really provide percolator-like functionality and because it wasn't upstreamed it fell out of maintenance.
Plug for my own question on the issue!

Solution

This is still a WiP but I am opening up as a PR to get community feedback. The current approach is to ingest queries as solr documents, decompose them for perfromance, and then use child-document feature to index the decomposed subqueries under one atomic parent document block. On the search side the latest approach is to use a dedicated component that creates hooks into lucene-monitor's Presearcher, QueryTermFilter and CandidateMatcher.

The current optional cache implementation uses caffeine instead of lucene-monitor's simpler ConcurrentHashMap. It's worth noting that this cache should likely be quite a bit larger than your average query or document cache since query parsing involves a non-trivial amount of compute and disk I/O (especially for large results and/or queries). It's also worth noting that lucene-monitor will keep all the indexed queries cached in memory with in its default configuration. A unique solr-monitor feature was the addition of a bespoke cache warmer that tries to populate the cache with approximately all the latest updated queries since the last commit. This approach was added to have a baseline when comparing with lucene-monitor performance. The goal was to make it possible to effectively cache all queries in memory (since that is what lucene-monitor enables by default) but not necessarily require it.

Currently the PR has some visitor classes in the org.apache.lucene.monitor package that exposes certain lucene-monitor internals. If this approach gets accepted then the lucene project will likely need to be updated to expose what is necessary.

Tests

testMonitorQuery: basic functionality before and after an update
testNoDocListInResponse: The current API allows for two types of responses, a special monitorDocuments response that can relay lucene-monitor's response structure and unique features such as "reverse" highlights. The other response structure is a regular solr document list with each "response" document really referring to a query that matches the "real" document that is being matched. This test ensures you can disable the solr document list from coming in the response.
testDefaultParser: validate that solr-monitor routes to default parser when none is selected.
testDisjunctionQuery: validate that subqueries of a disjunction get indexed seperately.
testNoDanglingDecomposition: validate that deleting a top-level query also removes all the child disjuncts.
testNotQuery
testWildCardQuery
testDefaultQueryMatchTypeIsNone: If no match type is selected with the monitorMatchType field then only a solr document list is returned (same behavior as "forward" search).
testMultiDocHighlightMatchType: Test highlight matcher on a multi-document batch and ensure it returns the character offsets and positions of all individual matches. It is worth noting that percolator returns the actual matching text snippet. This is something we could consider supporting within solr or adding to lucene-monitor.
testHighlightMatchType: Single doc highlight test. Slightly different than the one above in that the highlighted field does not need to be storeOffsetsWithPositions="true" which is pretty convenient. I am not sure if I am relying on a MemoryIndex implementation detail but it is a bit tedious for users to update their schemas to have storeOffsetsWithPositions="true" just to get character offsets back from the highlight matcher. I also don't know if there is a better way to handle the multi-doc case .. maybe break each doc into its own MemeoryIndex reader so that we got offsets by default without specifying storeOffsetsWithPositions="true"?
manySegmentsQuery: The cache warmer has reader-leaf-dependent logic so this was included to verify everything works on a multi-segment index.

All of the above are also tested with below custom configurations:

Parallel matcher - lucene monitor allows for running the final, most-expensive matching step in a multi-threaded environment. The current solr-monitor implementation allows for this with some restrictions. For instance, it is difficult to populate a document response list from a fully asynchronous matching component because it would require awkwardly opening and closing leaf collectors on-demand. The more idiomatic solr approach would be to just run this on many shards and gain parallelism as recommended here. Still, during testing I found that a fully async postfilter in a single shard had better performance than an equally-parallel multi-sharded, synchronous postfilter so I've decided to keep it in the initial proposal. On top of that, it helps achieve greater feature parity with lucene-monitor (which obviously has no concept of sharding so can only parallelize with a special matcher).
Stored monitor query - allow storing queries with stored="true" instead of using the recommended docValues. docValues have stricter single-value size limits so this is mainly to accommodate humongous queries

I'll report here that I also have some local performance tests which are difficult to port but that helped guide some of the decisions so far. I've also "manually" tested the custom tlog deserialization of the derived query field but this verification should probably go somewhere in a TlogReplay test. I haven't gone down that rabbit hole yet as I wanted to poll for some feedback first. The reason I skip TLog for the derived query fields is because these fields wrap a tokenstream which in itself is difficult to serialize without a custom analyzer. The goal was to let users leverage their existing document schema as often as possible instead of having to create something custom for the query-monitoring use-case.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check. TODO some apparently unrelated test failures
I have added tests for my changes.
I have added documentation for the Reference Guide

cpoerschke · 2024-04-15T11:48:18Z

solr/modules/monitor/src/test-files/solr/collection1/solrconfig-no-cache.xml

+  -->
+
+<config>
+  <luceneMatchVersion>9.4</luceneMatchVersion>


minor/maybe

Suggested change

<luceneMatchVersion>9.4</luceneMatchVersion>

<luceneMatchVersion>${tests.luceneMatchVersion:LATEST}</luceneMatchVersion>

thanks for catching this! .. I had to change it in a few other places so I made a separate commit

dsmiley · 2024-04-18T14:00:33Z

solr/modules/monitor/build.gradle

+
+apply plugin: 'java-library'
+
+description = 'Apache Solr Monitor'


That is so puzzling to anyone who isn't intimately familiar with Lucene Monitor. I don't even think we should be calling this "Solr Monitor"; looks like infrastructure monitoring thing. Possibly "Solr-Lucene-Monitor" but still... a puzzling name.

This is a great point .. The library used to be called luwak which I find to be a much better name... I'll try to think of a better name (maybe solr-reverse-search or solr-query-alerting). I'll reply in more detail to your mailing list message also touching on solr.cool and the sandbox.

Saved Searches is a common name, I assume it is possible to list a users's saved searches too. Or Alerting, but then most people will expect there to be some functionality to ship alerts somewhere...

You're right, if anything this might be a part of some larger alerting system, but "saved search" is more accurate.

Saved searches is a pretty indicative name. Percolator is also a known name for this kind of functionally.

Interesting, I thought ES invented "percolator" as more of a metaphor... I wasn't aware that this is a more generic name. I was worried that "percolator" might clash too much with ES.

cpoerschke

Hi @kotman12 - thanks for working on this!

I started "just browsing" on this PR this morning and so the inline comments may seem a bit random or general but sharing them anyhow in case they're useful. Not considered any naming or solr-versus-solr-sandbox-versus-elsewhere aspects at this point i.e. was just browsing.

cpoerschke · 2024-04-19T08:46:37Z

solr/modules/monitor/src/java/org/apache/solr/monitor/update/MonitorUpdateProcessorFactory.java

+  private String queryFieldNameOverride;
+  private String payloadFieldNameOverride;


subjective: could maybe initialise to the defaults here, overriding in init if applicable, and then avoid the null-or-not checks in getInstance

Suggested change

private String queryFieldNameOverride;

private String payloadFieldNameOverride;

private String queryFieldName = MonitorFields.MONITOR_QUERY ;

private String payloadFieldName = MonitorFields.PAYLOAD;

yea .. this makes sense to me. Is there really any value to these overrides though? I don't have a good reason why I chose to make these two fields overridable but not the other reserved fields. Is it safe to assume that a field prefixed by _ won't be in the user space anyway? If that is the case then this override business is overkill. Otherwise, we probably should make everything overridable.

cpoerschke · 2024-04-19T08:48:13Z

solr/modules/monitor/src/java/org/apache/solr/monitor/update/MonitorUpdateProcessorFactory.java

+
+public class MonitorUpdateProcessorFactory extends UpdateRequestProcessorFactory {
+
+  private Presearcher presearcher = PresearcherFactory.build();


noting that init has PresearcherFactory.build(presearcherType) also.

Suggested change

private Presearcher presearcher = PresearcherFactory.build();

private Presearcher presearcher;

With the latest change the Presearcher only gets initialized in the ReverseQueryParserPlugin and I share that core-level-singleton by making MonitorUpdateProcessorFactory a SolrCoreAware type. Not sure if there is a better pattern for this? This would be admittedly nicer with some kind of DI mechanism.

cpoerschke · 2024-04-19T08:54:01Z

solr/modules/monitor/src/java/org/apache/solr/monitor/package-info.java

+ */
+
+/** This package contains Solr's lucene monitor integration. */
+package org.apache.solr.monitor;


minor: surprised that package-info.java seems to be not needed for the lucene/monitor sub-directory, or maybe the checking logic just isn't checking for it.

I was hoping to actually make some changes to lucene in order to avoid the need for that package. I wanted to gauge the viability of "solr-monitor" before suggesting changes to the lucene upstream. The way I see it, lucene-monitor has very nice optimizations for making saved search fast but the interface is tightly sealed and makes very opinionated choices about stuff like caching which makes it hard to integrate into something like solr. Not to mention, lucene-monitor's index isn't "pluggable" or exposed in any way. It just seemed easier to expose the relevant algorithms within lucene-monitor rather than trying to hack in the whole kitchen sink into solr. Sorry about the tangent 😃

cpoerschke · 2024-04-19T09:00:03Z

solr/modules/monitor/src/java/org/apache/solr/monitor/update/MonitorUpdateRequestProcessor.java

+    this.queryFieldName = queryFieldName;
+    this.payloadFieldName = payloadFieldName;
+    this.core = core;
+    this.indexSchema = core.getLatestSchema();


Wondering if there's an assumption somehow w.r.t. queryFieldName and payloadFieldName being or not being within indexSchema -- and if there's an assumption to check it somewhere, maybe at initialisation time rather than when the first document(s) arrive to make use of the fields etc.

Likewise w.r.t. the MonitorFields.RESERVED_MONITOR_FIELDS referenced later on.

I've added a narrower MonitorFields.REQUIRED_MONITOR_FIELDS set which gets cross-validated against the schema in MonitorUpdateProcessorFactory::inform. In the same place I've also added some more specific schema-validations which get invoked for more specific configurations, i.e. which type of presearcher you are using.

cpoerschke · 2024-04-19T09:16:03Z

solr/modules/monitor/src/java/org/apache/solr/monitor/search/ReverseQueryParserPlugin.java

+  @Override
+  public void close() throws IOException {
+    super.close();
+  }
+
+  @Override
+  public void init(NamedList<?> args) {
+    super.init(args);
+  }


Suggested change

@Override

public void close() throws IOException {

super.close();

}

@Override

public void init(NamedList<?> args) {

super.init(args);

}

solr/modules/monitor/src/java/org/apache/solr/monitor/search/ReverseQueryParser.java

solr/modules/monitor/src/java/org/apache/solr/monitor/search/ReverseSearchComponent.java

Co-authored-by: Christine Poerschke <[email protected]>

kotman12 · 2024-04-20T01:26:31Z

solr/modules/monitor/src/java/org/apache/solr/monitor/search/ParallelSolrMatcherSink.java

+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.Query;
+
+class ParallelSolrMatcherSink<T extends QueryMatch> implements SolrMatcherSink {


@cpoerschke I wonder if this whole class would be made obviated by #2248 .. I found that because there can be significant overhead in pre-processing documents for reverse search (mainly analysis), parallelizing by throwing more solr cores at the problem wasn't quite as fast as simply parallelizing the expensive post filter. But it seems that if that PR (or something similar) was merged we might already run post filters in parallel for each segment?

good question, i don't know.

kotman12#2 proposes to defer parallel matching logic i.e. to not have it in the initial integration: that reduces scope and complexity initially code wise etc. and also from users' perspectives i.e. when setting a system up, should more shards or replica or more threads be used etc.

defer as opposed to remove i.e. it could come back as a future optimisation or enhancement, after an initial integration.

cpoerschke · 2024-06-24T17:29:29Z

solr/modules/monitor/src/java/org/apache/solr/monitor/search/QueryMatchType.java

+
+package org.apache.solr.monitor.search;
+
+public enum QueryMatchType {


kotman12#3 proposes to defer monitorMatchType support (always use simple matching) i.e. to not have it in the initial integration: that reduces scope and complexity initially code wise etc. and also from users' initial onboarding perspective, perhaps.

defer as opposed to remove i.e. looking at https://lucene.apache.org/core/9_11_0/monitor/org/apache/lucene/monitor/QueryMatch.html documentation there's not only HighlightsMatch but also ExplainingMatch and ScoringMatch.

@cpoerschke I responded on the actual PR but thinking I should have replied here instead

No worries at all. Perhaps 'main' points, like your kotman12#3 (comment) comment, could be "here" and 'minor' points, like my file-level https://github.com/kotman12/solr/pull/3/files#r1651440706 remark about the replacement of the queryMatchType.needsScores evaluation, could be "there" or something like that.

Picking up from part of the kotman12#3 (comment) points:

... But I think the doc list response is the correct approach because users are already familiar with it. It seamlessly integrates with other solr features ...

Yes, I think so too.

... I'm still thinking about how features like highlights or explaining match could fit into this (in a later iteration) ...

I don't yet understand the highlights and explaining matches enough to comment specifically on that, but conceptually document transformers might be a fit: https://solr.apache.org/guide/solr/latest/query-guide/document-transformers.html

I can see that working. Either way, it would be best to utilize lucene-monitor's highlight matches since they come with the character offsets for free. If we pipe that data to a DocTransformer (perhaps via setContext?), then we can bind it to a doc-level field... It shouldn't need any index access since the match offsets and document text will suffice. The document text will likely be in the reverse search request itself (and the reverse search index likely won't have the underlying document text anyway unless the same collection is being used). There might be some edge-cases like pre-analyzed document fields which can theoretically be matched in the monitor without text but that is for sophisticated users who probably understand the various limitations already.

As an aside, do you know why the solr highlighters don't seem to use a DocTransformer (not directly at least)? Prior to you linking this feature I assumed reverse search highlights would adhere to a similar API to existing highlights with that separate top-level highlighting field. But I can also see how this is a bit awkward and perhaps "vestigial" ... also there is an interesting 16-year old // TODO in that section, perhaps referring to the same thing 😃

As an aside, do you know why the solr highlighters don't seem to use a DocTransformer (not directly at least)?

Good question, I don't know but perhaps @dsmiley might.

... perhaps "vestigial" ...

I re-learnt an interesting word there :-)

also there is an interesting 16-year old // TODO in that section, perhaps referring to the same thing 😃

Once there's an existing way I guess it's a little tricky to move away from that, with backwards compatibility (adding a new alternative way, deprecating-but-retaining-the-existing way, then removing the existing way) but maybe there's also other reasons/benefits of using a top-level highlighting fields, I haven't thought about it much as yet.

https://issues.apache.org/jira/browse/SOLR-3479 PR welcome :-). DocTransformer came out after SearchComponent based Highlighter existed.
It would probably be not too hard to do one using the UnifiedHighlighter. It would also enable massive streaming response use-cases that can't easily be done today.

... But I think the doc list response is the correct approach because users are already familiar with it. It seamlessly integrates with other solr features ...

Yes, I think so too.

@cpoerschke the tests reminded me that another motivation for the top-level monitor response object is to handle the "batch" document response, i.e.:

{ "monitorDocuments": { "0": { "queries": [1]} "1": {"queries": [1, 2, 5]} } }

where a user matches many documents and expects the query matches to be bucketed per requested document. This is currently part of the original monitor API although it's arguable whether it is in scope for this initial solr integration.

The reason I bring it up is because your PR inspired me to try to remove the query match type altogether and that is one thing I ran into. I especially wanted to get rid of the noisy custom codec and merging logic. I was secretly hoping there would be a minimal way to add back the highlighting in a more integrated form (perhaps via some clever ReturnFields/DocTransformer rigging in the ReverseSearchComponent) after the de-noising but it might be challenging to do it minimally.

... matches many documents and expects the query matches to be bucketed per requested document. This is currently part of the original monitor API although it's arguable whether it is in scope for this initial solr integration. ...

Hmm, yes, that's an interesting question. I think my thoughts on this would be something along these lines:

Is matching multiple documents at the same time a common use case?

If it's not in scope for the initial integration then it could be added later potentially.

Single document makes for a simpler request and response interface from the user's perspective and from the code perspective too e.g. no need to check if there's zero documents in the batch or duplicates in the batch.

Multiple documents, or the potential for multiple documents, reduces familiarity with i.e. similarity to existing functionality i.e. it would be somewhat akin to supporting more than one q parameter in a search. And if (say) two documents are supplied, does it (from the user's point of view) mean to match against either of them or does it mean to match against both of them?

Nested documents -- I don't know much about them yet actually -- might they provide a way (beyond the initial integration here) to support multiple-documents logic via a single document?

https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-nested-documents.html

Co-authored-by: Christine Poerschke <[email protected]>

cpoerschke · 2024-06-26T17:53:54Z

solr/modules/monitor/src/test/org/apache/solr/monitor/MonitorSolrQueryTest.java

+  void validateDocList(QueryResponse response, int doc, Object expectedValue) {
+    SolrDocumentList actualValues = (SolrDocumentList) response.getResponse().get("response");
+    // minimal checks only so far; TODO: make this more comprehensive
+    if (expectedValue instanceof List) {
+      List<Object> expectedValues = (List) expectedValue;
+      if (expectedValues.size() == 1 && actualValues.size() <= 2) {
+        assertEquals(
+            expectedValues.get(0),
+            actualValues.get(actualValues.size() - 1 - doc).getFieldValue(MonitorFields.QUERY_ID));
+      }
+    }
+  }


kotman12#4 tries to make the MonitorSolrQueryTest.validate's expectedValue argument type more specific (String of List instead of Object) but even with that I haven't yet managed to create a validateDocList implementation for the existing test expectations. Not yet anyhow.

Resolved Conflicts: versions.lock

…her.ANYTOKEN_FIELD (which is now accessible in Lucene 9.11.1)

zzBrunoBrito · 2024-11-05T13:31:54Z

I'm really excited about this PR because that looks like a great feature to use. Are there any updates on this?

kotman12 · 2024-11-07T00:48:56Z

I'm really excited about this PR because that looks like a great feature to use. Are there any updates on this?

Hey @zzBrunoBrito , thanks for the comment! We are currently testing out a fork of this change internally in my company. Would you be willing to test this yourself as a beta user? Your feedback could drive decisions about the direction of this change, i.e. what features to include/exclude. Let me know what you need to test this module. If you can check out the branch and build it you should be able to pick up the solr-monitor-10.X.X-SNAPSHOT.jar from the solr/modules/monitor/build/libs/ directory and then add it to your solr process's classpath. I'll try to think of an easier way of distributing this change if that sounds too complicated.

The reason I ask for you to test it is because one of the blockers for merging this was the limited scope. "Solr-monitor" is a somewhat niche feature so the community is a little hesitant to commit to maintaining yet another module unless there is at least some interest in it (which is why your comment is much appreciated).

Maybe we can communicate about this on Solr's slack channel since there might be a bit of back-and-forth getting this to work for you at the start. I will eventually document this change but was putting that part off since I wasn't sure anyone would read it 😄

defer monitorMatchType support (always use simple matching)

solr/modules/monitor/src/java/org/apache/solr/monitor/search/SolrMonitorQueryCollector.java

cpoerschke · 2024-11-08T12:33:07Z

solr/modules/monitor/build.gradle

+    implementation project(":solr:solrj")
+    implementation "org.apache.lucene:lucene-core"
+    implementation "org.apache.lucene:lucene-monitor"
+    implementation 'com.github.ben-manes.caffeine:caffeine'


the CI says

Execution failed for task ':solr:modules:monitor:analyzeClassesDependencies'. > Dependency analysis found issues. usedUndeclaredArtifacts - io.dropwizard.metrics:metrics-core:4.2.26@jar

and a remedy might be (something like)

Suggested change

implementation 'com.github.ben-manes.caffeine:caffeine'

implementation 'com.github.ben-manes.caffeine:caffeine'

implementation 'io.dropwizard.metrics:metrics-core'

however with that and ./gradlew clean ; ./gradlew :solr:modules:monitor:analyzeClassesDependencies locally it then instead says

Execution failed for task ':solr:modules:monitor:analyzeClassesDependencies'. > Dependency analysis found issues. unusedDeclaredArtifacts - io.dropwizard.metrics:metrics-core:4.2.25@jar

instead i.e. the opposite.

…solr-monitor

cpoerschke · 2024-11-12T18:30:58Z

solr/modules/monitor/src/java/org/apache/solr/monitor/MonitorConstants.java

+
+  private MonitorConstants() {}
+
+  public static final String MONITOR_DOCUMENTS_KEY = "monitorDocuments";


minor: could this be relocated to its (only) place of use in the ReverseSearchComponent?

cpoerschke · 2024-11-12T18:39:29Z

solr/modules/monitor/src/java/org/apache/solr/monitor/search/ReverseSearchComponent.java

+      docAsMap.putIfAbsent(
+          req.getSchema().getUniqueKeyField().getName(), UUID.randomUUID().toString());


Wondering if it would be simpler to require the caller to supply the key field.

I added this based on user feedback. Alerts are typically not triggered on the Id field yet the id field is the only guaranteed required field. It seemed unnecessary to force users to supply an id field just to satisfy the validation of a different use-case (i.e. they aren't indexing this document but rather "alerting" on its contents so they shouldn't be required to have any fields). Obviously this doesn't help with required non-id fields but that's a more specific use-case. Because the fix was simple enough, I figured I'd add it. Is there a downside to this I'm not considering?

I guess I should correct myself.. id/uniqueKey field isn't strictly required but I think most schemas will have it (and most likely should have it) and so I wanted to accomodate it...

Edit
I suppose the id could be a non-string field? But that should hopefully be rare plus there is an easy workaround..

My "would be simpler" thoughts were something along these lines:

"the key field must always be supplied" is clear to the user (and for the code)

"the key field must never be supplied" is clear to the user (and for the code)

"if the key field is not (fully) supplied it will be auto-generated" raises a bunch of questions:

will the auto-generation influence the results, hopefully not the results directly but perhaps results ordering?

what if the auto-generation (for missing key fields) clashes with supplied key field values?

can the auto-generation support my particular type of key field type e.g. composite routing -- https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#document-routing

in a sharded setup, could different shards auto-generate the same key (or could a key auto-generated on one shard clash with a supplied key on another shard), unlikely in practice though?

Perhaps a middle ground could be "the key field must be supplied for all documents or for none of the documents, and if it's not supplied then ... will be used to auto-generate and that will only work for ... fields, if you use another key field type then you must supply the keys" or something like that.

cpoerschke · 2024-11-12T18:52:11Z

solr/core/src/java/org/apache/solr/handler/component/DebugComponent.java

+      ReverseSearchDebugInfo rsdebug =
+          (ReverseSearchDebugInfo) (rb.req.getContext().get(ReverseSearchDebugInfo.KEY));
+      if (rsdebug != null) {
+        info.add("reverse-search-debug", rsdebug.getReverseSearchDebugInfo());
+      }
+


If DebugComponent.java had a method something like this

protected void addCustomInfo(ResponseBuilder rb, NamedList<Object> info) { FacetDebugInfo fdebug = (FacetDebugInfo) (rb.req.getContext().get("FacetDebugInfo")); if (fdebug != null) { info.add("facet-trace", fdebug.getFacetDebugInfo()); } fdebug = (FacetDebugInfo) (rb.req.getContext().get("FacetDebugInfo-nonJson")); if (fdebug != null) { info.add("facet-debug", fdebug.getFacetDebugInfo()); } }

then a solr/modules/monitor (or indeed any other custom plugin) could extend the class and override the method.

Added 44eea79 commit to explore that. WDYT?

I like it. It would be nice it you didn't have to use inheritance because it makes "layering" debug components challenging (i.e. if the debug component could iterate over other customInfoProviders to collect custom debug metrics..) . But that's a hypothetical problem since nothing that requires this exists yet and the DI mechanism in solr is a bit painful right now so doing composition is probably messier.

One question I have is why not keep the ReverseSearchDebugComponent in the default component list for the reverseSearchHandler? The searchHandler includes DebugComponent in its default components so I would think we'd want to keep that consistent.

I suppose an alternative way would be to introduce some normalized List of CustomInfoProviders in the request context .. the inteface could like like:

interface CustomDebugInfoProvider { SimpleOrderedMap<Object> asMap() String name(); }

and you could fetch it via:

List<CustomDebugInfoProvider > customInfos = (List<CustomDebugInfoProvider >) request.getContext().get("customDebugInfoProvider") for (var customInfo: customInfos) { info.add(customInfo.name(), customInfo.asMap()); }

LEt me know what you think .. I can also make a commit to clarify..

... an alternative way ...

Yeah, I had no particular way or implementation in mind as such, only that as a concept solr/modules/foobar components leaving 'footprint' in solr/core DebugComponent.java is perhaps undesirable and that 'solving' it would also benefit custom components in general, something along those lines.

One question I have is why not keep the ReverseSearchDebugComponent in the default component list for the reverseSearchHandler? The searchHandler includes DebugComponent in its default components so I would think we'd want to keep that consistent.

The QueryComponent and DebugComponent are implicit components, so that makes it easier to include them in the defaults I guess. No strong views on including or not including ReverseSearchDebugComponent in the default components but I wanted to also explore/demonstrate that the code works even if the debug component is not configured.

3e7a53c (somewhat subjectively of course) removes the sink factory since there's only one type of sink now, this (perhaps) makes the code easier to understand and it also allowed rb.isDebug() qualification for the collection of the debug info.

cpoerschke · 2024-11-13T12:57:25Z

solr/modules/monitor/src/java/org/apache/solr/monitor/search/ReverseSearchHandler.java

+  @Override
+  protected List<String> getDefaultComponents() {
+    ArrayList<String> names = new ArrayList<>(2);
+    names.add(QueryComponent.COMPONENT_NAME);


Added bf0daf7 commit to explore not configuring the query component.

Suggested change

names.add(QueryComponent.COMPONENT_NAME);

…ink type

cpoerschke · 2024-11-13T15:18:46Z

solr/modules/monitor/src/java/org/apache/solr/monitor/search/ReverseSearchComponent.java

+            QueryMatch.SIMPLE_MATCHER::createMatcher,
+            new IndexSearcher(documentBatch.get()),
+            matchingQueries -> {
+              if (rb.isDebug()) {


Just noting that the 3e7a53c commit is a refactor except for the addition of the rb.isDebug() qualification here. WDYT?

And I wonder, if the solr/core DebugComponent had a 'custom info provider' concept, could ReverseSearchComponent potentially implement that i.e. no need for a custom reverse debug component then? Though maybe I'm still not understanding in detail enough the interaction between the reverse query component and the base query component and consequently the interaction with the debug component after that, and the resulting implementation nuances etc.

cpoerschke · 2024-11-13T15:50:35Z

solr/modules/monitor/src/java/org/apache/solr/monitor/search/SolrMonitorQueryCollector.java

+    var originalMatchQuery = entry.getMatchQuery();
+
+    var matchQuery = new ConstantScoreQuery(originalMatchQuery);
+
+    boolean isMatch = matcherSink.matchQuery(queryId, matchQuery, entry.getMetadata());
+    if (isMatch && !queryId.equals(lastQueryId)) {
+      lastQueryId = queryId;
+      super.collect(doc);
+    }


Seeking to better understand the lastQueryId logic here:

could this method be called from multiple threads concurrently?

could the equality check happen earlier? what would happen if there was no check and the doc was always collected if there's a match?

Suggested change

var originalMatchQuery = entry.getMatchQuery();

var matchQuery = new ConstantScoreQuery(originalMatchQuery);

boolean isMatch = matcherSink.matchQuery(queryId, matchQuery, entry.getMetadata());

if (isMatch && !queryId.equals(lastQueryId)) {

lastQueryId = queryId;

super.collect(doc);

}

if (!queryId.equals(lastQueryId)) {

var originalMatchQuery = entry.getMatchQuery();

var matchQuery = new ConstantScoreQuery(originalMatchQuery);

boolean isMatch = matcherSink.matchQuery(queryId, matchQuery, entry.getMetadata());

if (isMatch) {

lastQueryId = queryId;

super.collect(doc);

}

}

vs.

Suggested change

var originalMatchQuery = entry.getMatchQuery();

var matchQuery = new ConstantScoreQuery(originalMatchQuery);

boolean isMatch = matcherSink.matchQuery(queryId, matchQuery, entry.getMetadata());

if (isMatch && !queryId.equals(lastQueryId)) {

lastQueryId = queryId;

super.collect(doc);

}

var originalMatchQuery = entry.getMatchQuery();

var matchQuery = new ConstantScoreQuery(originalMatchQuery);

boolean isMatch = matcherSink.matchQuery(queryId, matchQuery, entry.getMetadata());

if (isMatch) {

super.collect(doc);

}

I can revert .. this was also driven by user feedback. Lucene-monitor decomposes top-level disjunctions, i.e. this or that into two separate queries this and that which roll up to the same queryId but are indexed as separate documents for performance reasons. Basically I wanted to deduplicate and was under the impression that query/leaf collectors weren't thread-safe and would always be called by a single thread at a time. But perhaps thats not true or its too much of an implementation detail. I haven't yet considered where else this deduplication could go...

I index disjunctions as nested documents which I assume would always be collected in the correct order since they occupy the same block.

Edit

Ideally I wanted to avoid grouping, i.e. keeping large in-memory sets to achieve the deduplication... Hence the reliance on ordering .. but maybe this isn't the right place and/or right approach.

cpoerschke · 2024-11-13T16:57:36Z

solr/modules/monitor/src/java/org/apache/solr/monitor/search/ReverseSearchComponent.java

+    var searcher = req.getSearcher();
+    MonitorQueryCache solrMonitorCache =
+        (SharedMonitorCache) searcher.getCache(this.solrMonitorCacheName);
+    SolrMonitorQueryDecoder queryDecoder = new SolrMonitorQueryDecoder(req.getCore());


From code reading

Suggested change

SolrMonitorQueryDecoder queryDecoder = new SolrMonitorQueryDecoder(req.getCore());

SolrMonitorQueryDecoder queryDecoder = new SolrMonitorQueryDecoder(this.queryDecomposer);

looks possible here, will attempt to do that.

edit: never mind, it looked tempting to remove the SolrCore (and thus solr) dependency from the class but the core is needed for the SharedMonitorCacheLatestRegenerator code path.

…reading comprehension (also removes duplicate req.getSearcher().getCache() call)

cpoerschke · 2024-11-13T18:23:36Z

solr/modules/monitor/src/java/org/apache/solr/monitor/search/ReverseSearchComponent.java

+    var req = rb.req;
+    var documentBatch = documentBatch(req);
+    var matcherSink =
+        new SyncSolrMatcherSink<>(
+            QueryMatch.SIMPLE_MATCHER::createMatcher,
+            new IndexSearcher(documentBatch.get()),
+            matchingQueries -> {
+              if (rb.isDebug()) {
+                rb.req
+                    .getContext()
+                    .put(
+                        ReverseSearchDebugComponent.ReverseSearchDebugInfo.KEY,
+                        new ReverseSearchDebugComponent.ReverseSearchDebugInfo(
+                            matchingQueries.getQueriesRun()));
+              }
+            });
+    Query preFilterQuery = presearcher.buildQuery(documentBatch.get(), getTermAcceptor(rb.req));
+    List<Query> mutableFilters =
+        Optional.ofNullable(rb.getFilters()).map(ArrayList::new).orElseGet(ArrayList::new);
+    rb.setQuery(new MatchAllDocsQuery());
+    mutableFilters.add(preFilterQuery);
+    var searcher = req.getSearcher();
+    MonitorQueryCache solrMonitorCache =
+        (SharedMonitorCache) searcher.getCache(this.solrMonitorCacheName);
+    SolrMonitorQueryDecoder queryDecoder = new SolrMonitorQueryDecoder(req.getCore());
+    mutableFilters.add(
+        new MonitorPostFilter(
+            new SolrMonitorQueryCollector.CollectorContext(
+                solrMonitorCache, queryDecoder, matcherSink)));
+    rb.setFilters(mutableFilters);


Added 113ca8f commit with some reordering and edits here to aid code reading comprehension, also removes duplicate req.getSearcher().getCache() call as a side effect.

cpoerschke · 2024-11-13T18:52:15Z

solr/modules/monitor/src/java/org/apache/lucene/monitor/DocumentBatchVisitor.java

+ *
+ */
+
+package org.apache.lucene.monitor;


Opened apache/lucene#13993 to propose to make DocumentBatch public.

kotman12 added 4 commits March 29, 2024 23:18

integrate lucene-monitor into solr

bba4191

move MonitorDataValues and check in license

fe3b413

update versions.lock

0989f1c

add package-info to monitor packages

62ddcbc

github-actions bot added tool:build tests dependencies Dependency upgrades labels Apr 1, 2024

kotman12 added 15 commits April 1, 2024 19:29

extract helper method

33eccf7

apply errorprone suggestions

fa7fb40

implement highlight matches

11a1138

AggregatingMatcher -> MatchesAggregator

7526b2f

make monitor query cache optional

6ac7a5e

move manySegmentsTest to ParallelMonitorSolrQueryTest

2362ce7

call CandidateMatcher directly

e5a1382

remove doc forwarding callback

b7b58b3

instantiate decoder in outer loop

ce40d60

ignore score for relevant match types

a201676

read MAX_SIZE_PARAM for maxSize

60b0eb5

don't drop cause

fd04c4e

remove superstitious delete calls

32cb6f6

add testDeleteByQueryId

b6d369b

enable setting maxRamMB for monitor cache

0ab5a6b

cpoerschke reviewed Apr 15, 2024

View reviewed changes

hardcoding luceneMatchVersion is bad

0b0120e

dsmiley reviewed Apr 18, 2024

View reviewed changes

cpoerschke reviewed Apr 19, 2024

View reviewed changes

kotman12 and others added 3 commits April 19, 2024 18:32

add multi-pass presearcher and optional field aliasing

24c1bf5

more accurate error

ee2992d

Co-authored-by: Christine Poerschke <[email protected]>

redundant override

6815ea1

Co-authored-by: Christine Poerschke <[email protected]>

kotman12 commented Apr 20, 2024

View reviewed changes

cpoerschke reviewed Jun 24, 2024

View reviewed changes

always use ConstantScoreQuery

483e46f

Co-authored-by: Christine Poerschke <[email protected]>

cpoerschke reviewed Jun 26, 2024

View reviewed changes

cpoerschke added 2 commits July 1, 2024 10:41

Merge remote-tracking branch 'origin/main' into solr-monitor

9ed6b48

Resolved Conflicts: versions.lock

remove MonitorFields.ANYTOKEN_FIELD in favour of TermFilteredPresearc…

28bf264

…her.ANYTOKEN_FIELD (which is now accessible in Lucene 9.11.1)

Merge pull request #3 from cpoerschke/solr-monitor-cpoerschke-6

908f462

defer monitorMatchType support (always use simple matching)

cpoerschke reviewed Nov 8, 2024

View reviewed changes

solr/modules/monitor/src/java/org/apache/solr/monitor/search/SolrMonitorQueryCollector.java Show resolved Hide resolved

add back inadvertently removed import

13dffda

cpoerschke reviewed Nov 8, 2024

View reviewed changes

kotman12 added 2 commits November 11, 2024 19:58

remove stand-alone monitor API

cbf45de

Merge branch 'solr-monitor' of https://github.com/kotman12/solr into …

a3fc3d5

…solr-monitor

github-actions bot added the cat:search label Nov 12, 2024

fix tests

a24d66a

cpoerschke reviewed Nov 12, 2024

View reviewed changes

cpoerschke added 2 commits November 13, 2024 12:33

avoid (minimise) solr/core changes

44eea79

tentative: no need to configure the query component

bf0daf7

cpoerschke reviewed Nov 13, 2024

View reviewed changes

subjective: remove SolrMatcherSinkFactory now that there's only one S…

3e7a53c

…ink type

cpoerschke reviewed Nov 13, 2024

View reviewed changes

cpoerschke mentioned this pull request Nov 13, 2024

make monitor fields configurable kotman12/solr#5

Open

cpoerschke reviewed Nov 13, 2024

View reviewed changes

some ReverseSearchComponent.prepare reordering and edits to aid code …

113ca8f

…reading comprehension (also removes duplicate req.getSearcher().getCache() call)

cpoerschke reviewed Nov 13, 2024

View reviewed changes

cpoerschke mentioned this pull request Nov 13, 2024

lucene-monitor: make abstract DocumentBatch public apache/lucene#13993

Open

cpoerschke reviewed Nov 13, 2024

View reviewed changes

	<luceneMatchVersion>9.4</luceneMatchVersion>
	<luceneMatchVersion>${tests.luceneMatchVersion:LATEST}</luceneMatchVersion>


		apply plugin: 'java-library'

		description = 'Apache Solr Monitor'

		private String queryFieldNameOverride;
		private String payloadFieldNameOverride;


		public class MonitorUpdateProcessorFactory extends UpdateRequestProcessorFactory {

		private Presearcher presearcher = PresearcherFactory.build();

	private Presearcher presearcher = PresearcherFactory.build();
	private Presearcher presearcher;


		package org.apache.solr.monitor.search;

		public enum QueryMatchType {

	implementation 'com.github.ben-manes.caffeine:caffeine'
	implementation 'com.github.ben-manes.caffeine:caffeine'
	implementation 'io.dropwizard.metrics:metrics-core'


		private MonitorConstants() {}

		public static final String MONITOR_DOCUMENTS_KEY = "monitorDocuments";

		docAsMap.putIfAbsent(
		req.getSchema().getUniqueKeyField().getName(), UUID.randomUUID().toString());

	SolrMonitorQueryDecoder queryDecoder = new SolrMonitorQueryDecoder(req.getCore());
	SolrMonitorQueryDecoder queryDecoder = new SolrMonitorQueryDecoder(this.queryDecomposer);

SOLR-4587: integrate lucene-monitor into solr #2382

Are you sure you want to change the base?

SOLR-4587: integrate lucene-monitor into solr #2382

Conversation

kotman12 commented Apr 1, 2024 • edited Loading

Description

Solution

Tests

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpoerschke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kotman12 Apr 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zzBrunoBrito commented Nov 5, 2024

kotman12 commented Nov 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kotman12 Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kotman12 Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

cpoerschke Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kotman12 commented Apr 1, 2024 •

edited

Loading

kotman12 Apr 20, 2024 •

edited

Loading

kotman12 Nov 13, 2024 •

edited

Loading

kotman12 Nov 13, 2024 •

edited

Loading

cpoerschke Nov 13, 2024 •

edited

Loading