Skip to content

De-duplicate keywords before search #641

@brent-hartwig

Description

@brent-hartwig

Problem Description:

While the following idea was born from an extreme example, it could be beneficial when the system is under strain and users could get faster responses for qualifying searches.

#633 documents a search with 217 words. When not quoted, the search takes 25-26 seconds, which is greater than the 20 second time out. MarkLogic is removing 95 stop words, but that still leaves 122 that have the keyword search pattern applied to them, resulting in 122 words being individually checked in the referenceName index (>22 mln values) of all documents linked to by the lux('itemAny') predicate (>20 mln docs). That's a lot of index look ups and ensuing joins.

#635's optimization idea would not have helped with this particular search because the keyword search pattern includes a values function.

Expected Behavior/Solution:
After removing stop words, de-duplicate the remaining words and phrases. Restrict to criteria resolved against the same document set and in the same grouping (AND or OR). In #633's case, that would have got us down to 105 or 106 depending on case-sensitivity. Speculating, that could have allowed the query to complete in 87% of the time, which is not necessarily enough to come in under the current timeout for search.

Requirements:
See above.

Needed for promotion:
If an item on the list is not needed, it should be crossed off but not removed.

  • Wireframe/Mockup - Not needed
  • Committee discussions - Sarah
  • Feasibility/Team discussion - Sarah
  • Backend requirements - See above
  • Middle tier requirements - None
  • Frontend requirements - None
  • Are new regression tests required for QA - Zach
  • Questions
  • List of questions for discussions. Answers should be documented within the issue.

UAT/LUX Examples:

Dependencies/Blocks:

  • Blocked By: Nothing
  • Blocking: Better performance for qualifying searches.

Related Github Issues:

None at the time of submission.

Related links:

None at the time of submission.

Wireframe/Mockup:
N/A

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions