Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

18 search improvements #27

Draft
wants to merge 89 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 60 commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
9ff8e4f
Bump luceneVersion from 5.3.0 to 8.6.0
dependabot[bot] Aug 11, 2020
48433e5
Update imports and replace deprecated functionality
patrick-austin Jan 11, 2022
60b659b
Enable basic sorted set facets #19
patrick-austin Jan 12, 2022
f3b1dff
Update pom.xml with Facets #19
patrick-austin Jan 12, 2022
fc0d2d3
Merge pull request #20 from icatproject/dependabot/maven/luceneVersio…
patrick-austin Jan 24, 2022
1229a9a
Query on datafile date property. Fixes #8
stuartpullinger Jun 5, 2020
290ad81
Update release notes for 1.1.1 release
MRichards99 Aug 24, 2021
d80515e
[maven-release-plugin] prepare release v1.1.1
MRichards99 Aug 24, 2021
7bf3459
[maven-release-plugin] prepare for next development iteration
MRichards99 Aug 24, 2021
a44db65
[maven-release-plugin] prepare release v1.1.1
stuartpullinger Aug 27, 2021
32f9fbe
Converted setup to python 3
stuartpullinger Jan 14, 2020
b902385
Update icat.utils version
MRichards99 Sep 16, 2021
4416be4
Update version and release notes
MRichards99 Sep 16, 2021
8a36fb0
Add snapshot to version
MRichards99 Sep 16, 2021
f0be663
[maven-release-plugin] prepare release v1.1.2
MRichards99 Sep 16, 2021
0ea7709
Replace travis.yml with ci-build.yml #13
patrick-austin Jan 11, 2022
df3f18a
Update CI status badge for GHA #13
patrick-austin Jan 11, 2022
093a0ff
Move strategy matrix inside build #13
patrick-austin Jan 11, 2022
e81defd
Remove redundant inclue #13
patrick-austin Jan 21, 2022
aec760f
Change OpenJDK distribution #13
patrick-austin Jan 21, 2022
3a4c301
Change Maven command to "mvn test -B" #13
patrick-austin Jan 24, 2022
b5b5d2d
Avoid index error for maxScore
patrick-austin Feb 2, 2022
3ecdaac
Add synonym injection on search #16
patrick-austin Jan 11, 2022
bcf46af
Avoid index error for maxScore
patrick-austin Feb 2, 2022
2046da5
Handle facet exceptions from server tests #19
patrick-austin Feb 10, 2022
7c12768
Add script to generate synonyms from csv #16
patrick-austin Feb 11, 2022
b32f3aa
Take equivalent labels into account #16
patrick-austin Feb 12, 2022
3b5fd8c
Change order of terms in tests #16
patrick-austin Feb 12, 2022
fea2d47
Replace searcherManager with readerManager #19
patrick-austin Mar 9, 2022
fee6356
Merge branch 'master' into dependabot/maven/luceneVersion-8.6.0
patrick-austin Mar 23, 2022
a4a822b
Enable sorting of string fields #25
patrick-austin Mar 24, 2022
8eda4ca
Add support for fields and searchAfter #25
patrick-austin Mar 26, 2022
851cedb
Implement incremental sharding #26
patrick-austin Apr 4, 2022
bb53a1c
Merge branch '18_search_improvements' into 25_enable_field_sorting
patrick-austin Apr 4, 2022
cf74dc8
Merge pull request #28 from icatproject/25_enable_field_sorting
patrick-austin Apr 4, 2022
04ae002
Merge branch '18_search_improvements' into 26_multireader_subindices
patrick-austin Apr 4, 2022
e7b47db
Merge pull request #29 from icatproject/26_multireader_subindices
patrick-austin Apr 4, 2022
9477ea8
Rename JSON keys for clarity over id #18
patrick-austin Apr 6, 2022
434b66b
Text fields and related entities #30
patrick-austin Apr 8, 2022
41daae5
Merge branch '18_search_improvements' into 19_enable_facets
patrick-austin Apr 8, 2022
f1801b0
Merge pull request #31 from icatproject/30_encode_related_ids
patrick-austin Apr 12, 2022
fbc99e6
Enable generic String and range facets #19
patrick-austin Apr 13, 2022
8907a7c
Basic unit conversion #19
patrick-austin Apr 14, 2022
8438e1f
Add unit conversion dependencies #19
patrick-austin Apr 14, 2022
45a3948
Refactor unit conversion to utils #19
patrick-austin Apr 30, 2022
8856738
Use mapping for parseSearchAfter types #19
patrick-austin May 25, 2022
008c68a
WIP sharding changes from stash #19
patrick-austin Jun 1, 2022
49373f5
Add fields needed for DGS component #19
patrick-austin Jun 8, 2022
2fc0f8e
Use .keyword for string facets #19
patrick-austin Jun 10, 2022
757da57
Filters and aborted search support #19
patrick-austin Jun 16, 2022
973d31c
Allow searchAfter for uneven shards #19
patrick-austin Jun 16, 2022
b3d4c52
Sparse string faceting fix #19
patrick-austin Jun 15, 2022
663ea42
Enable parsing of multivalued filters #19
patrick-austin Jun 17, 2022
eaafc89
Refactors and Javadoc comments #19
patrick-austin Jun 20, 2022
4913230
Support for searching on sample name #19
patrick-austin Jun 22, 2022
338dda3
SampleParameter, fileCount, value in range #19
patrick-austin Jul 22, 2022
ce51e33
Add utility to lock #19
patrick-austin Aug 2, 2022
5f59e1d
Formatting changes #19
patrick-austin Jul 24, 2022
902654b
Improved timeout and search syntax errors #19
patrick-austin Aug 5, 2022
1eac7e0
Error handling fix and range check for lock #19
patrick-austin Aug 9, 2022
182b5e5
Fix shardList not accepting new shards #19
patrick-austin Aug 17, 2022
cd37717
Merge pull request #22 from icatproject/19_enable_facets
patrick-austin Aug 17, 2022
2a24bf7
Merge branch '18_search_improvements' into 16_enable_synonyms
patrick-austin Aug 17, 2022
eabef14
Merge branch '18_search_improvements' into 16_enable_synonyms
patrick-austin Aug 17, 2022
d8d1e76
Move synonym analyzer to DocumentMapping #16
patrick-austin Aug 17, 2022
32c2f33
Add support for faceting DatasetTechnique #18
patrick-austin Sep 7, 2022
d051925
Update version #18
patrick-austin Sep 9, 2022
2e359ee
Refactor Field and large Lucene functions #18
patrick-austin Sep 29, 2022
4a7e9db
run.properties settings updates #18
patrick-austin Oct 12, 2022
deceb46
Merge branch '18_search_improvements' into 16_enable_synonyms
patrick-austin Oct 17, 2022
7e53648
parse_synonyms clean up and check for null synonyms #16
patrick-austin Oct 17, 2022
c790b5d
Remove returns from Field.java #18
patrick-austin Oct 21, 2022
8662e05
Update Lucene to 8.11.2 and remove search caching #18
patrick-austin Nov 24, 2022
885b876
Replace numRamDocs with hasUncommittedChanges #18
patrick-austin Nov 24, 2022
ee9da02
Cache state for facets #18
patrick-austin Jan 20, 2023
421020b
InvestigationFacilityCycle support
patrick-austin Jan 23, 2023
0a2f653
Merge pull request #34 from icatproject/18_memory_leaks
patrick-austin Sep 6, 2023
1e8ea2b
Merge pull request #38 from icatproject/18b_store_state
patrick-austin Sep 6, 2023
4a511a9
Merge pull request #21 from icatproject/16_enable_synonyms
patrick-austin Sep 6, 2023
65a1c44
Merge branch 'master' into 18_search_improvements
patrick-austin Sep 6, 2023
3ce34c6
Replace javax with jakarta in new files
patrick-austin Sep 6, 2023
453a725
3.0.0 release notes
patrick-austin Sep 8, 2023
d31e5b7
Index id as long instead of String #18
patrick-austin Sep 26, 2023
3dc957a
Refactor facetable fields into run.properties #18
patrick-austin Sep 28, 2023
c9f2154
Add short explanations of new properties #18
patrick-austin Oct 5, 2023
b6d3e60
Add special handling for InvestigationInstrument filters #18
patrick-austin Oct 6, 2023
61301a2
Fix for Investigation Sample filtering #18
patrick-austin Oct 10, 2023
e3f393e
Account for IcatUnits refactors
patrick-austin Mar 22, 2024
bcbe497
Add new properties to init logging
patrick-austin Apr 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 9 additions & 6 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

<groupId>org.icatproject</groupId>
<artifactId>icat.lucene</artifactId>
<version>1.1.2</version>
<version>2.0.0-SNAPSHOT</version>
<packaging>war</packaging>
<name>ICAT Lucene</name>

Expand All @@ -14,7 +14,7 @@
<repoUrl>https://repo.icatproject.org/repo</repoUrl>
<project.scm.id>github</project.scm.id>
<gitUrl>https://github.com/icatproject/icat.lucene</gitUrl>
<luceneVersion>5.3.0</luceneVersion>
<luceneVersion>8.6.0</luceneVersion>
</properties>

<repositories>
Expand Down Expand Up @@ -86,6 +86,12 @@
<version>${luceneVersion}</version>
</dependency>

<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-facet</artifactId>
<version>${luceneVersion}</version>
</dependency>

<dependency>
<groupId>javax</groupId>
<artifactId>javaee-api</artifactId>
Expand All @@ -95,7 +101,7 @@
<dependency>
<groupId>org.icatproject</groupId>
<artifactId>icat.utils</artifactId>
<version>4.16.1</version>
<version>4.17.0-SNAPSHOT</version>
</dependency>

<dependency>
Expand Down Expand Up @@ -327,6 +333,3 @@

<description>Exposes lucene calls to an icat server</description>
</project>



1 change: 1 addition & 0 deletions src/main/config/run.properties.example
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@

directory = ${HOME}/data/lucene
commitSeconds = 5
maxShardSize = 2147483648
ip = 127.0.0.1/32
133 changes: 133 additions & 0 deletions src/main/java/org/icatproject/lucene/DocumentMapping.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
package org.icatproject.lucene;

import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

import org.apache.lucene.queryparser.flexible.standard.StandardQueryParser;

public class DocumentMapping {

/**
* Represents the parent child relationship between two ICAT entities.
*/
public static class ParentRelationship {
public String parentName;
public String joiningField;
public Set<String> fields;

/**
* @param parentName Name of the parent entity.
* @param joiningField Field that joins the child to its parent.
* @param fields Fields that should be updated by this relationship.
*/
public ParentRelationship(String parentName, String joiningField, String... fields) {
this.parentName = parentName;
this.joiningField = joiningField;
this.fields = new HashSet<>(Arrays.asList(fields));
}
}

public static final Set<String> doubleFields = new HashSet<>();
public static final Set<String> facetFields = new HashSet<>();
public static final Set<String> longFields = new HashSet<>();
public static final Set<String> sortFields = new HashSet<>();
public static final Set<String> textFields = new HashSet<>();
public static final Set<String> indexedEntities = new HashSet<>();
public static final Map<String, ParentRelationship[]> relationships = new HashMap<>();

public static final IcatAnalyzer analyzer = new IcatAnalyzer();
public static final StandardQueryParser genericParser = new StandardQueryParser();
public static final StandardQueryParser datafileParser = new StandardQueryParser();
public static final StandardQueryParser datasetParser = new StandardQueryParser();
public static final StandardQueryParser investigationParser = new StandardQueryParser();
public static final StandardQueryParser sampleParser = new StandardQueryParser();

static {
doubleFields.addAll(Arrays.asList("numericValue", "numericValueSI", "rangeTop", "rangeTopSI", "rangeBottom",
"rangeBottomSI"));
facetFields.addAll(Arrays.asList("type.name", "datafileFormat.name", "stringValue", "technique.name"));
longFields.addAll(
Arrays.asList("date", "startDate", "endDate", "dateTimeValue", "investigation.startDate", "fileSize",
"fileCount"));
sortFields.addAll(
Arrays.asList("datafile.id", "dataset.id", "investigation.id", "instrument.id", "id", "sample.id",
"sample.investigation.id", "date", "name", "stringValue", "dateTimeValue", "numericValue",
"numericValueSI", "fileSize", "fileCount"));
textFields.addAll(Arrays.asList("name", "visitId", "description", "location", "dataset.name",
"investigation.name", "instrument.name", "instrument.fullName", "datafileFormat.name", "sample.name",
"sample.type.name", "technique.name", "technique.description", "technique.pid", "title", "summary",
"facility.name", "user.fullName", "type.name", "doi"));

indexedEntities.addAll(Arrays.asList("Datafile", "Dataset", "Investigation", "DatafileParameter",
"DatasetParameter", "DatasetTechnique", "InstrumentScientist", "InvestigationInstrument",
"InvestigationParameter", "InvestigationUser", "Sample", "SampleParameter"));

relationships.put("Instrument",
new ParentRelationship[] { new ParentRelationship("InvestigationInstrument", "instrument.id",
"instrument.name", "instrument.fullName") });
relationships.put("User",
new ParentRelationship[] {
new ParentRelationship("InvestigationUser", "user.id", "user.name", "user.fullName"),
new ParentRelationship("InstrumentScientist", "user.id", "user.name", "user.fullName") });
relationships.put("Sample", new ParentRelationship[] {
new ParentRelationship("Dataset", "sample.id", "sample.name", "sample.investigation.id"),
new ParentRelationship("Datafile", "sample.id", "sample.name", "sample.investigation.id") });
relationships.put("SampleType",
new ParentRelationship[] { new ParentRelationship("Sample", "type.id", "type.name"),
new ParentRelationship("Dataset", "sample.type.id", "sample.type.name"),
new ParentRelationship("Datafile", "sample.type.id", "sample.type.name") });
relationships.put("InvestigationType",
new ParentRelationship[] { new ParentRelationship("Investigation", "type.id", "type.name") });
relationships.put("DatasetType",
new ParentRelationship[] { new ParentRelationship("Dataset", "type.id", "type.name") });
relationships.put("DatafileFormat",
new ParentRelationship[] {
new ParentRelationship("Datafile", "datafileFormat.id", "datafileFormat.name") });
relationships.put("Facility",
new ParentRelationship[] { new ParentRelationship("Investigation", "facility.id", "facility.name") });
relationships.put("ParameterType",
new ParentRelationship[] { new ParentRelationship("DatafileParameter", "type.id", "type.name"),
new ParentRelationship("DatasetParameter", "type.id", "type.name"),
new ParentRelationship("InvestigationParameter", "type.id", "type.name"),
new ParentRelationship("SampleParameter", "type.id", "type.name") });
relationships.put("Technique",
new ParentRelationship[] { new ParentRelationship("DatasetTechnique", "technique.id", "technique.name",
"technique.description", "technique.pid") });
relationships.put("Investigation",
new ParentRelationship[] {
new ParentRelationship("Dataset", "investigation.id", "investigation.name",
"investigation.title", "investigation.startDate", "visitId"),
new ParentRelationship("datafile", "investigation.id", "investigation.name", "visitId") });
relationships.put("Dataset",
new ParentRelationship[] { new ParentRelationship("Datafile", "dataset.id", "dataset.name") });

genericParser.setAllowLeadingWildcard(true);
genericParser.setAnalyzer(analyzer);

CharSequence[] datafileFields = { "name", "description", "location", "datafileFormat.name", "visitId",
"sample.name", "sample.type.name", "doi" };
datafileParser.setAllowLeadingWildcard(true);
datafileParser.setAnalyzer(analyzer);
datafileParser.setMultiFields(datafileFields);

CharSequence[] datasetFields = { "name", "description", "sample.name", "sample.type.name", "type.name",
"visitId", "doi" };
datasetParser.setAllowLeadingWildcard(true);
datasetParser.setAnalyzer(analyzer);
datasetParser.setMultiFields(datasetFields);

CharSequence[] investigationFields = { "name", "visitId", "title", "summary", "facility.name",
"type.name", "doi" };
investigationParser.setAllowLeadingWildcard(true);
investigationParser.setAnalyzer(analyzer);
investigationParser.setMultiFields(investigationFields);

CharSequence[] sampleFields = { "sample.name", "sample.type.name" };
sampleParser.setAllowLeadingWildcard(true);
sampleParser.setAnalyzer(analyzer);
sampleParser.setMultiFields(sampleFields);
}
}
105 changes: 105 additions & 0 deletions src/main/java/org/icatproject/lucene/FacetedDimension.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
package org.icatproject.lucene;

import java.util.ArrayList;
import java.util.List;

import javax.json.Json;
import javax.json.JsonObjectBuilder;

import org.apache.lucene.facet.FacetResult;
import org.apache.lucene.facet.LabelAndValue;
import org.apache.lucene.facet.range.DoubleRange;
import org.apache.lucene.facet.range.LongRange;
import org.apache.lucene.facet.range.Range;

/**
* For a single dimension (field), stores labels (the unique values or ranges of
* values for that field in the index) and their respective counts (the number
* of times that label appears in different documents).
*
* For example, a dimension might be "colour", the label "red", and the count 5.
*/
public class FacetedDimension {

private String dimension;
private List<Range> ranges;
private List<String> labels;
private List<Long> counts;

/**
* Creates an "empty" FacetedDimension. The dimension (field) is set but ranges,
* labels and counts are not.
*
* @param dimension The dimension, or field, to be faceted
*/
public FacetedDimension(String dimension) {
this.dimension = dimension;
this.ranges = new ArrayList<>();
this.labels = new ArrayList<>();
this.counts = new ArrayList<>();
}

/**
* Extracts the count for each label in the FacetResult. If the label has
* already been encountered, the count is incremented rather than being
* overridden. Essentially, this allows faceting to be performed across multiple
* shards.
*
* @param facetResult A Lucene FacetResult object corresponding the relevant
* dimension
*/
public void addResult(FacetResult facetResult) {
for (LabelAndValue labelAndValue : facetResult.labelValues) {
String label = labelAndValue.label;
int labelIndex = labels.indexOf(label);
if (labelIndex == -1) {
labels.add(label);
counts.add(labelAndValue.value.longValue());
} else {
counts.set(labelIndex, counts.get(labelIndex) + labelAndValue.value.longValue());
}
}
}

/**
* Formats the labels and counts into Json.
*
* @param aggregationsBuilder The JsonObjectBuilder to add the facets for this
* dimension to.
*/
public void buildResponse(JsonObjectBuilder aggregationsBuilder) {
JsonObjectBuilder bucketsBuilder = Json.createObjectBuilder();
for (int i = 0; i < labels.size(); i++) {
JsonObjectBuilder bucketBuilder = Json.createObjectBuilder();
bucketBuilder.add("doc_count", counts.get(i));
if (ranges.size() > i) {
Range range = ranges.get(i);
if (range.getClass().getSimpleName().equals("LongRange")) {
bucketBuilder.add("from", ((LongRange) range).min);
bucketBuilder.add("to", ((LongRange) range).max);
} else if (range.getClass().getSimpleName().equals("DoubleRange")) {
bucketBuilder.add("from", ((DoubleRange) range).min);
bucketBuilder.add("to", ((DoubleRange) range).max);
}
}
bucketsBuilder.add(labels.get(i), bucketBuilder);
}
aggregationsBuilder.add(dimension, Json.createObjectBuilder().add("buckets", bucketsBuilder));
}

/**
* @return The list of Lucene Range Objects for use with numerical facets.
* For String faceting, this will be empty.
*/
public List<Range> getRanges() {
return ranges;
}

/**
* @return The dimension that these labels and counts correspond to.
*/
public String getDimension() {
return dimension;
}

}
21 changes: 16 additions & 5 deletions src/main/java/org/icatproject/lucene/IcatAnalyzer.java
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,33 @@
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.analysis.en.EnglishPossessiveFilter;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.analysis.standard.StandardFilter;
// import org.apache.lucene.analysis.standard.StandardAnalyzer ;
import org.apache.lucene.analysis.standard.StandardTokenizer;

// public class IcatAnalyzer extends Analyzer {

// @Override
// protected TokenStreamComponents createComponents(String fieldName) {
// StandardAnalyzer analyzer = new StandardAnalyzer(EnglishAnalyzer.ENGLISH_STOP_WORDS_SET);
// Analyzer.TokenStreamComponents stream = analyzer.createComponents(fieldName);
// sink = new EnglishPossessiveFilter(stream.getTokenStream());
// sink = new PorterStemFilter(sink);
// return new TokenStreamComponents(source, sink);
// }
// }

VKTB marked this conversation as resolved.
Show resolved Hide resolved
public class IcatAnalyzer extends Analyzer {

@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer source = new StandardTokenizer();
TokenStream sink = new StandardFilter(source);
sink = new EnglishPossessiveFilter(sink);
TokenStream sink = new EnglishPossessiveFilter(source);
sink = new LowerCaseFilter(sink);
sink = new StopFilter(sink, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
sink = new StopFilter(sink, EnglishAnalyzer.ENGLISH_STOP_WORDS_SET);
sink = new PorterStemFilter(sink);
return new TokenStreamComponents(source, sink);
}
Expand Down
Loading