diff --git a/.gitignore b/.gitignore
index 610afc00..087db308 100644
--- a/.gitignore
+++ b/.gitignore
@@ -53,3 +53,5 @@ js/ssSearch-debug.js
js/ssSearch.js
js/*.js.map
+# VSCode Configuration
+.idea
\ No newline at end of file
diff --git a/build.xml b/build.xml
index 06e59768..cf320f44 100644
--- a/build.xml
+++ b/build.xml
@@ -104,7 +104,7 @@
- Only the <params> element is necessary, but, as we discuss shortly, we highly suggest taking advantage
- of the <rules> (see 8.5.3 Specifying rules (optional)) and <contexts> (8.5.4 Specifying contexts (optional)) for the best results. For examples of full configuration files, see the staticSearch GitHub repository as well as the list of projects in 12 Projects using staticSearch, which provides a link to each site’s configuration file.
-
-
+
-
+
@@ -168,32 +170,24 @@
@@ -201,6 +195,10 @@
+
+ 8.5 Creating a configu
For examples of full configuration files, see the staticSearch GitHub repository as well as the list of projects in 13 Projects using staticSearch, which provides a link to each site’s configuration file.
+The <params> element has four required elements for determining the resource collection that you - wish to index, and controlling the indexing process:
+The <params> element has only one required element, which is used for determining the resource + collection that you wish to index:
| file | +(The path (relative to the config file) to the search page.) | +
The search page is a regular HTML page which forms part of your site. The only important - characteristic it must have is a <div> element with id=staticSearch, whose contents will be rewritten by the staticSearch build process. See 8.6 Creating a search page. A stopword is a word that will not be indexed, because it is too common (the, a, you and so on). There are common stopwords files for most languages available on the - Web, but it is probably a good idea to take one of these and customize it for your - project, since there will be words in the website which are so common that it makes - no sense to index them, but they are not ordinary stopwords. For example, in a Website - dedicated to the work of John Keats, the name keats should probably be added to the stopwords file, since almost every page will include - it, and searching for it will be pointless. The project has a built-in set of common - stopwords for English, which you'll find in xsl/english_stopwords.txt. One way to find appropriate stopwords for your site is to generate your index, then - search for the largest JSON index files that are generated, to see if they might be - too common to be useful as search terms. You can also use the Word Frequency table in the generated staticSearch report (see 8.9 Generated report). The indexing process checks each word as it builds the index, and keeps a record of - all words which are not found in the configured dictionary. Though this does not have - any direct effect in the indexing process, all words not found in the dictionary are - listed in the staticSearch report (see 8.9 Generated report). This can be very useful: all words listed are either foreign (not part of the language - of the dictionary) or perhaps misspelled (in which case they may not be correctly - stemmed and index, and should be corrected). There is a default dictionary in xsl/english_words.txt which you might copy and adapt if you're working in English; lots of dictionaries - for other languages are available on the Web.
-The <searchFile> element is a relative URI (resolved, like all URIs specified in the config file, - against the configuration file location) that points directly to the search page that - will be the primary access point for the search. Since the search file must be at - the root of the directory that you wish to index (i.e. the directory that contains + characteristic it must have is a <div> element with id=staticSearch, whose contents will be rewritten by the staticSearch build process. See 8.6 Creating a search page.
+The <searchPage> element's file attribute specifies a relative URI (resolved, like all URIs specified in the config + file, against the configuration file location) that points directly to the search + page that will be the primary access point for the search. Since the search file must + be at the root of the directory that you wish to index (i.e. the directory that contains all of the XHTML you want the search to index), the searchFile parameter provides the necessary information for knowing what document collection to index and where to put the output JSON. In other words, in specifying the location of your search @@ -612,49 +656,107 @@
We also require the <recurse> element in the case where the document collection may be nested (as is common with - static sites generated from Jekyll or Wordpress). The <recurse> element is a boolean (true or false) that determines whether or not to recurse into - the subdirectories of the collection and index those files.
-Finally, in order to support stemming and phrasal search effectively, it is important
- to specify a <stopwordsFile> (a file containing words that will be ignored at index time) and a <dictionaryFile> (also used for indexing). Default files for English and French are supplied in the
- xsl folder, but you will probably want to create or customize the stopword list for your
- own project. You may also supply empty text files for these parameters if for example
- you donʼt want to use a stoplist at all.
The following parameters are optional, but most projects will want to specify some of them:
| recurse | +(Determines whether or not to recurse into the subdirectories of the collection and + index those files.) | +
<versionFile> enables you to specify the path to a plain-text file containing a simple version - number for the project. This might take the form of a software-release-style version - number such as 1.5, or it might be a Subversion revision number or a Git commit hash. It should not - contain any spaces or punctuation. If you provide a version file, the version string - will be used as part of the filenames for all the JSON resources created for the search. - This is useful because it allows the browser to cache such resources when users repeatedly - visit the search page, but if the project is rebuilt with a new version, those cached - files will not be used because the new version will have different filenames. The - path specified is relative to the location of the configuration file (or absolute, - if you wish).
++
| file | +(The path (relative to the config file) to a text file containing a list of words to + be ignored by the indexer (one word per line).) | +
A stopword is a word that will not be indexed, because it is too common (the, a, you and so on). There are common stopwords files for most languages available on the + Web, but it is probably a good idea to take one of these and customize it for your + project, since there will be words in the website which are so common that it makes + no sense to index them, but they are not ordinary stopwords. For example, in a website + dedicated to the work of John Keats, the name keats should probably be added to the stopwords file, since almost every page will include + it, and searching for it will be pointless. staticSearch provides a default set of common stopwords for English, which you'll + find in xsl/english_stopwords.txt. One way to find appropriate stopwords for your site is to generate your index, then + search for the largest JSON index files that are generated, to see if they might be + too common to be useful as search terms. You can also use the Word Frequency table in the generated staticSearch report (see 8.9 Generated report).
+| file | +(The relative path (from the config file) to a dictionary file (one word per line).) | +
The indexing process checks each word as it builds the index, and keeps a record of + all words which are not found in the configured dictionary. Though this does not have + any direct effect in the indexing process, all words not found in the dictionary are + listed in the staticSearch report (see 8.9 Generated report). This can be very useful: all words listed are either foreign (not part of the language + of the dictionary) or perhaps misspelled (in which case they may not be correctly + stemmed and index, and should be corrected). staticSearch provides a default dictionary in xsl/english_words.txt that can be copied and adapted if working in English; lots of dictionaries for other + languages are available on the Web.
+| minWordLength | +(Specifies the minimum length in characters of a sequence of text that will be considered + to be a word worth indexing.) | +
| name | +(Specifies the name of the scoring algorithm to use.) | +
Phrasal search functionality enables your users to search for specific phrases by
- surrounding them with quotation marks ("), as in many search engines. In order to support this kind of search, <createContexts> must also be set to true as we store contexts for all hits for each token in each
- document. Turning this on will make the index larger, because all contexts must be
- stored, but once the index is built, it has very little impact on the speed of searches,
- so we recommend turning this on. The default value is true. However, if your site is very large and your user base is unlikely to use phrasal
- searching, it may not be worth the additional build time and increased index size.
<scoringAlgorithm> is an optional element that specifies which scoring algorithm to use when calculating + the score of a term and thus the order in which the results from a search are sorted.
| dir | +(The path (relative to the config file) of the directory to use for stemming.) | +
The staticSearch project currently has only two real stemmers: an implementation of
the Porter 2 algorithm for modern English, and an implementation of the French Snowball
@@ -682,105 +784,120 @@ 8.5.2.2 Optional param
cases where there are mixed languages so a single stemmer will not do. To use this
option, specify the value stripDiacritics in your configuration file.
<scoringAlgorithm> is an optional element that specifies which scoring algorithm to use when calculating - the score of a term and thus the order in which the results from a search are sorted. - There are currently two options:
--
<createContexts> is a boolean parameter that specifies whether you want the indexer to store keyword-in-context - extracts for each of the hits in a document. This increases the size of the index, - but of course it makes for much more user-friendly search results; instead of seeing - just a score for each document found, the user will see a series of short text strings - with the search keyword(s) highlighted. Note that contexts are necessary for phrasal searching or wildcard searching.
-<minWordLength> specifies the minimum length in characters of a sequence of text that will be considered - to be a word worth indexing. The default is 3, on the basis that in most European - languages, words of one or two letters are typically not worth indexing, being articles, - prepositions and so on. If you set this to a lower limit for reasons specific to your - project, you should ensure that your stopword list excludes any very common words - that would otherwise make the indexing process lengthy and increase the index size.
-<maxKwicsToHarvest> controls the number of keyword-in-context extracts that will be harvested from the - data for each term in a document. For example, if a user searches for the word ‘elephant’, and it occurs 27 times in a document, but the <maxKwicsToHarvest> value is set to 5, then only the first five (sorted in document order) of these keyword-in-context - strings will be stored in the index. (This does not affect the score of the document - in the search results, of course.) If you set this to a low number, the size of the - JSON files will be constrained, but of course the user will only be able to see the - KWICs that have been harvested in their search results. If <phrasalSearch> is set to true, the <maxKwicsToHarvest> setting is ignored, because phrasal searches will only work properly if all contexts - are stored.
-A user may search for multiple common words, so hundreds of hits could be found in - a single document. If the keyword-in-context strings for all these hits are shown - on the results page, it would be too long and too difficult to navigate. This setting - controls how many of those hits you want to show for each document in the result set.
-| create | +Specifies whether the indexer stores keyword-in-context extracts for each hit in a + document. | +
| create | +Specifies whether the indexer stores keyword-in-context extracts for each hit in a + document. | +
| phrasalSearch | +(Whether or not to support phrasal searches. If this is true, then the maxContexts + setting will be ignored, because all contexts are required to properly support phrasal + search.) | +
| wildcardSearch | +(Whether or not to support wildcard searches.) | +
| maxKwicsToHarvest | +(Controls the number of keyword-in-context extracts that will be harvested from the + data for each term in a document.) | +
| maxKwicLength | +(Sets the maximum length (in words) of a keyword-in-context result.) | +
| maxKwicsToHarvest | +(Controls the number of keyword-in-context extracts that will be harvested from the + data for each term in a document.) | +
| kwicTruncateString | +(The string that will be used to signal ellipsis at the beginning and end of a keyword-in-context + extract. Conventionally three periods, or an ellipsis character (which is the default + value).) | +
Obviously, the longer the keyword-in-context strings are, the larger the individual - index files will be, but the more useful the KWICs will be for users looking at the - search results. Note that the phrasal searching relies on the KWICs and thus longer - KWICs allow for longer phrasal searches.
+Note that contexts are necessary for phrasal searching or wildcard searching.
| resultsPerPage | +(The maximum number of document results to be displayed per page. All results are displayed + by default; setting resultsPerPage to a positive integer creates a Show More/Show + All widget at the bottom of the batch of results.) | +
| maxKwicsToShow | +(Controls the maximum number of keyword-in-context extracts that will be shown in the + search page for each hit document returned.) | +
The only reason you might need to specify a value for this parameter is if the language - of your search page conventionally uses a different ellipsis character. Japanese, - for example, uses the 3-dot-leader character.
+
| file | +(The path (relative to the config file) to a text file containing a single version + identifier (such as 1.5, 123456, or 06ad419).) | +
For most sites, where the number of results is likely to be in the low thousands, - it's perfectly practical to show all the results at once, because the staticSearch - processor is so fast. However, if you have tens of thousands of documents, and it's - possible that users will do (for example) filter-only searches that retrieve a large - proportion of them, you can constrain the number of results which are shown initially - using this setting. All the results are still generated and output to the page, but - since most of them are hidden until the ‘Show More’ or ‘Show All’ button is clicked, - the browser will render them much more quickly.
+<version> enables you to specify the path to a plain-text file containing a simple version + number for the project. This might take the form of a software-release-style version + number such as 1.5, or it might be a Subversion revision number or a Git commit hash. It should not + contain any spaces or punctuation. If you provide a version file, the version string + will be used as part of the filenames for all the JSON resources created for the search. + This is useful because it allows the browser to cache such resources when users repeatedly + visit the search page, but if the project is rebuilt with a new version, those cached + files will not be used because the new version will have different filenames. The + path specified is relative to the location of the configuration file (or absolute, + if you wish).
| dir | +(A pointer to a local directory.) | +
When the staticSearch build process creates its output, many files need to be added to the website for which an index is being created. For convenience, all of these files are stored in a single folder. This element is used to specify the name of that folder. The default is staticSearch, but if you would prefer something else, you can specify it here. You may also use this element if you are defining two different searches within the same site, so that - their files are kept in different locations.
+ their files are kept in different locations.The value of the match attribute is transformed in a XSLT template match attribute, and thus must follow
the same rules (i.e. no complex rules like p/ancestor::div). See the W3C XSLT Specification for further details on allowable pattern rules.
Note that the indexer does not tokenize any content in the <head> of the document (but as noted in 8.1 Configuring your site: search filters, metadata can be configured into filters) and that all elements in the <body> of a document are considered tokenizable. However, common elements that you might want to exclude include:
@@ -832,9 +949,9 @@"...the size of the index.Search filtering using any metadata you like,..."⚓+
"...the size of the index.Search filtering using any metadata you like,..."⚓
"...nothing to say here,Some information on this subject can be found...⚓To tell the tokenizer that the <span> constitutes the context block for any of its tokens, use the <context> element with an match pattern: -
"...nothing to say here,Some information on this subject can be found...⚓To tell the tokenizer that the <span> constitutes the context block for any of its tokens, use the <context> element with an match pattern: +
The default context elements are:
Pages may contain different kinds of blocks, or ‘contexts’, that need to be differentiated. For example, consider a page for an online journal article, which includes the article’s title, an abstract, the body of the article, and footnotes. Users may want to search @@ -907,7 +1024,7 @@
A complex site may have two or more search pages targetting specific types of document or content, each of which may need its own particular search controls and indexes. - This can easily be achieved by specifying a different <searchFile> and <outputFolder> in the configuration file for each search.
+ This can easily be achieved by specifying a different <searchPage> and <output> in the configuration file for each search.However, it's also likely that you will want to exclude certain features or documents from a specialized search page, and this is done using the <excludes> section and its child <exclude> elements.
Note that once your file has been processed and all this content has been added, you can process it again at any time; there is no need to start every time with a clean, @@ -1017,7 +1134,7 @@
Before running the search on your own site, you can test that your system is able to do the build by doing the (very quick) build of the test materials. If you simply run the ant command, like this:
-mholmes@linuxbox:~/Documents/staticSearch$ ant⚓+
mholmes@linuxbox:~/Documents/staticSearch$ ant⚓
you should see a build process proceed using the small test collection of documents, and at the end, a results page should open up giving you a report on what was done. If this fails, then you'll need to troubleshoot the problem based on any error messages @@ -1027,7 +1144,7 @@
If the tests all work, then you're ready to build a search for your own site. Now you need to run the same command, but this time, tell the build process where to find your custom configuration file:2
-ant -DssConfigFile=/home/mholmes/mysite/config_staticSearch.xml⚓+
ant -DssConfigFile=/home/mholmes/mysite/config_staticSearch.xml⚓
The same process should run, and if it's successful, you should have a modified search.html page as well as a lot of index files in JSON format in your site HTML folder. Now you can test your own search in the same ways suggested above.
ssConfig or an absolute path using ssConfigFile). Assuming that the build file, your config file, and your staticSearch directory
are all at the root of the project, you could call the staticSearch build in ant like
so:
- -lib parameter (since the project's version of Saxon may conflict, for instance, with
the version used by staticSearch). If your build requires the use of the -lib parameter, then an alternative approach for calling staticSearch from your build
is to use the exec task like so:
- After indexing your HTML files, the staticSearch build then generates an HTML report of helpful statistics and diagnostics about your document collection, which can be - found in the directory specified by <outputFolder>. We recommend looking at this file regularly, especially if you're encountering unexpected + found in the directory specified by <output>. We recommend looking at this file regularly, especially if you're encountering unexpected behaviour by the staticSearch engine, as it contains information that can often help diagnose issues with configured filters or the HTML document collection that, if fixed, can improve staticSearch results.
By default, the report includes only basic information about the number of stem files created, the the filters used, and any problems encountered. However, if you run the build process using the additional parameter ssVerboseReport:
-ant -DssVerboseReport=true -DssConfigFile=...⚓+
ant -DssVerboseReport=true -DssConfigFile=...⚓
then the report will also include a number of tables that outline some statistics about your project. However, please note that compiling these statistics is very memory-intensive and if your site is large, it may cause the build process to run out of memory.
As of version 1.4, the word frequency table is a separate document and is no longer included as part of the verbose report. Instead, after running a build, you can then build just the word frequency table with the special concordance target:
-ant -DssConfigFile=path/to/your/config.xml concordance⚓+
ant -DssConfigFile=path/to/your/config.xml concordance⚓
While the chart itself is not necessary for the core functionality of staticSearch, it is particularly useful during the initial development of a project’s search engine; it can be used to create and fine-tune the project-specific stopword list (i.e. if @@ -1091,18 +1208,18 @@
You can add as many custom attributes as you like (although bear in mind that they increase the size of the index JSON files slightly and may add to the build time).
Those links are also provided with a search string, like this: https://example.com/egPage.html?ssMark=elephant#animals This link points to the section of the document which has id=animals, but it also says ‘the hit text is the word elephant.’ Some JavaScript that runs on the target page, egPage.html (which you control) will be able to parse the value of the query parameter ssMark in order to find the hit text, and highlight it in some way.
Obviously you can implement this any way you like (or just ignore it), but we also supply a small demonstration JavaScript library which implements this functionality, - called ssHighlight.js. This JS file is included into the staticSearch output folder (see <outputFolder>) by default, and if you include it into the header of your own pages, it will probably + called ssHighlight.js. This JS file is included into the staticSearch output folder (see <output>) by default, and if you include it into the header of your own pages, it will probably do the highlighting without further intervention. If, however, you have lots of existing JavaScript that runs when the page loads, there may be some interference between this library and your own code, so you may have to make some adjustments to the code.
@@ -1200,7 +1317,7 @@The search page created for your website is entirely driven by JavaScript. The JavaScript source code can be found in a number of .js files inside the repository js folder. At build time, these files (with the exception of ssHighlight.js and ssInitialize.js) are first concatenated into a single large file called ssSearch-debug.js. This file is then optimized using the Google Closure Compiler, to create a smaller file called ssSearch.js which should be faster for the browser to download and parse. Both of these output - files are provided in your project <outputFolder>; ssSearch.js is linked in your search page, but if you're having problems and would like to debug + files are provided in your project <output>; ssSearch.js is linked in your search page, but if you're having problems and would like to debug with more human-friendly JavaScript, you can switch that link to point to ssSearch-debug.js.
We are still experimenting with the options and affordances of the Closure compiler, in the interests of finding the best balance between file size and performance.
@@ -1227,7 +1344,7 @@ssVerbose property to true at the command line:
- ant -DssConfig=cfg.xml -DssVerbose=true ⚓Note that verbosity settings persist after creating the initial config; so, if you +
ant -DssConfig=cfg.xml -DssVerbose=true ⚓Note that verbosity settings persist after creating the initial config; so, if you are trying to debug just the tokenization process, you must make sure to run the config target beforehand: -
ant config tokenize -DssConfig=cfg.xml -DssVerbose=true ⚓+
ant config tokenize -DssConfig=cfg.xml -DssVerbose=true ⚓
export ANT_OPTS="-Xmx4g"; ant -DjavaFork=false -DssConfigFile=/absolute/path/to/your/config.xml⚓How much memory you can and should provide to Ant depends on your particular system +
export ANT_OPTS="-Xmx4g"; ant -DjavaFork=false -DssConfigFile=/absolute/path/to/your/config.xml⚓How much memory you can and should provide to Ant depends on your particular system and the size of the document collection. See Ant's documentation for some further examples and explanation. The javaFork parameter prevents calls to Java processes (such as Saxon) from forking into a new Java VM, which allows them to take advantage of the expanded memory you have assigned to Ant.
If you are a programmer, you may want to write your own code to interact with staticSearch + in some way. To make it easier to work with the JavaScript that runs on the search + page, the StaticSearch object dispatches a number of CustomEvents which you can listen + for in your own code. These are the events:
+ssInstantiated: This event is dispatched at the end of the constructor for the StaticSearch object,
+ indicating that it is instantiated and you may access its properties and methods.ssJsonRetrieved: This is dispatched when the StaticSearch object has completed its initial retrieval
+ of the core JSON required to handle searches.ssSearchStarting: This is dispatched when the user has initiated a search operation.ssFormCleared: This is dispatched when the user has cleared the search form.ssSearchCompleted: This is dispatched when a search has been completed and the results displayed on
+ the search page. This event carries additional information in its detail property: an integer which is the count of the number of hit documents found.You can use these events in your code in the following way:
+var showMessage = function(){ alert('The StaticSearch instance has been constructed!'); } window.addEventListener('ssInstantiated', showMessage); ⚓
+ In the case of the ssSearchCompleted event, you could do this:
var showMessage = function(evt){ alert(evt.detail.hits + ' documents found!'); } window.addEventListener('ssSearchCompleted', showMessage); ⚓
+ Windows support has been added courtesy of a pull request from Tony Graham, and we - will continue to test and maintain it from now on.
+staticSearch 2.0 contains breaking changes and improvements to staticSearch. In particular, the configuration file has been significantly re-organized; configuration files made for earlier versions of staticSearch will not work in 2.0.
@@ -1295,7 +1449,7 @@ssVerbose property in ant. To get debugging messages, set the ssVerbose parameter to true (other
accepted values: t, yes, y, 1)
- ant -DssConfig=cfg.xml -DssVerbose=true ⚓+
ant -DssConfig=cfg.xml -DssVerbose=true ⚓
.fidLink{ display:none; } ⚓
+ .fidLink{ display:none; } ⚓
Minor enhancement:
Bug fix:
Bug fix:
Deprecations requiring changes to existing projects:
concordance target in ant:
- ant concordance -DssConfig=cfg.xml⚓+
ant concordance -DssConfig=cfg.xml⚓
Bug fixes:
All issues and tickets related to version 1.4 can be found on GitHub.
Note that version 1.2 was withdrawn in favour of version 1.3, so the list below includes changes from the original version 1.2 and the current 1.3.
Deprecations requiring changes to existing projects:
@@ -1460,13 +1614,13 @@json.xsl, which has also improved the build performance slightly.clean step of the build process.New features and enhancements:
+<content> <sequence minOccurs="1" maxOccurs="1"> <elementRef key="params"/> @@ -1701,18 +1855,18 @@+ ⚓Appendix A.1.1 <con <elementRef key="filters" minOccurs="0"/> </sequence> </content> - ⚓
+
element config
{
attribute version { text }?,
( params, rules?, contexts?, excludes?, filters? )
-}⚓
+}⚓
+<content> <empty/> </content> - ⚓+ ⚓
+
element context
{
att.match.attributes,
att.labelled.attributes,
attribute context { text }?,
empty
-}⚓
+}⚓
| <contexts> (The set of context elements that identify contexts for keyword-in-context fragments.) | +<contexts> (The set of context that identify contexts for keyword-in-context fragments.) | |||
| Namespace | @@ -1873,19 +2027,19 @@||||
| Content model |
- +<content> - <elementRef key="context" minOccurs="1" + <elementRef key="context" minOccurs="0" maxOccurs="unbounded"/> </content> - ⚓+ ⚓ |
|||
| Schema Declaration |
-
-element contexts { context+ }⚓
+
+element contexts { context* }⚓
|
|||
<createContexts> is a boolean parameter that specifies whether you want the indexer to store keyword-in-context - extracts for each of the hits in a document. This increases the size of the index, - but of course it makes for much more user-friendly search results; instead of seeing - just a score for each document found, the user will see a series of short text strings - with the search keyword(s) highlighted.
Note that contexts are necessary for phrasal searching or wildcard searching.
+<content> - <dataRef name="boolean"/> + <empty/> </content> - ⚓+ ⚓
-element createContexts { xsd:boolean }⚓
+
+element createContexts
+{
+ (
+ ( attribute create { "false" }? )
+ | (
+ attribute create { "true" }?,
+ attribute phrasalSearch { text }?,
+ attribute wildcardSearch { text }?,
+ attribute maxKwicsToHarvest { text }?,
+ attribute maxKwicLength { text }?,
+ attribute kwicTruncateString { text }?
+ )
+ ),
+ empty
+}⚓
| <dictionaryFile> (The relative path (from the config file) to a dictionary file (one word per line) - which will be used to check tokens when indexing.) | +<dictionary> (Specifies a dictionary against which tokens may be checked during indexing.) | ||||||||||
| Namespace | @@ -1971,6 +2137,40 @@Module | ss — Schema specification and tag documentation | |||||||||
| Attributes | +
+
+
+
|
+ ||||||||||
| Contained by |
@@ -1983,11 +2183,7 @@ Appendix A.1.5 <dic | ||||||||||
| May contain | -
-
-
- XSD anyURI
- |
+ Empty element | |||||||||
| Note | @@ -1997,25 +2193,26 @@|||||||||||
| Content model |
- +<content> - <dataRef name="anyURI"/> + <empty/> </content> - ⚓+ ⚓ |
||||||||||
| Schema Declaration |
-
-element dictionaryFile { xsd:anyURI }⚓
+
+element dictionary { attribute file { text }, empty }⚓
|
||||||||||
+<content> <empty/> </content> - ⚓+ ⚓
+
element exclude
{
att.match.attributes,
attribute type { "index" | "filter" },
empty
-}⚓
+}⚓
+<content> <elementRef key="exclude" minOccurs="1" maxOccurs="unbounded"/> </content> - ⚓+ ⚓
-element excludes { exclude+ }⚓
+
+element excludes { exclude+ }⚓
+<content> <elementRef key="span" minOccurs="1" maxOccurs="unbounded"/> </content> - ⚓+ ⚓
-element filter { attribute filterName { text }, span+ }⚓
+
+element filter { attribute filterName { text }, span+ }⚓
+<content> <elementRef key="filter" minOccurs="1" maxOccurs="unbounded"/> </content> - ⚓+ ⚓
-element filters { filter+ }⚓
+
+element filters { filter+ }⚓
| <kwicTruncateString> (The string that will be used to signal ellipsis at the beginning and end of a keyword-in-context - extract. Conventionally three periods, or an ellipsis character (which is the default - value).) | +<index> (Configures options relating to indexing.) | ||||||||||
| Namespace | @@ -2340,6 +2535,44 @@Module | ss — Schema specification and tag documentation | |||||||||
| Attributes | +
+
+
+
|
+ ||||||||||
| Contained by |
@@ -2352,452 +2585,17 @@ Appendix A.1.10 <kw | ||||||||||
| May contain | -Character data only | +Empty element | |||||||||
| Note | -
- The only reason you might need to specify a value for this parameter is if the language - of your search page conventionally uses a different ellipsis character. Japanese, - for example, uses the 3-dot-leader character. - |
-
| Content model | -
- -<content> - <textNode/> -</content> - ⚓- |
-
| Schema Declaration | -
-
-element kwicTruncateString { text }⚓
- |
-
| <linkToFragmentId> (Whether to link keyword-in-context extracts to the nearest id in the document. Default - is true.) | -|
| Namespace | -http://hcmc.uvic.ca/ns/staticSearch | -
| Module | -ss — Schema specification and tag documentation | -
| Contained by | -
- —
- |
-
| May contain | -
-
-
- XSD boolean
- |
-
| Note | -
- <linkToFragmentId> is a boolean parameter that specifies whether you want the search engine to link - each keyword-in-context extract with the closest element that has an id. If the element has an ancestor with an id, then the indexer will associate that keyword-in-context extract with that id; if there are no suitable ancestor elements that have an id, then the extract is associated with first preceding element with an id. -We strongly recommend that you ensure your target documents have id attributes for - any significant divisions so that this parameter can be used effectively. With lots - of ids throughout your documents, and this parameter turned on, each keyword-in-context - in the results page will be linked directly to the section of the document in which - the hit appears, making the search results much more useful. - |
-
| Content model | -
- -<content> - <dataRef name="boolean"/> -</content> - ⚓- |
-
| Schema Declaration | -
-
-element linkToFragmentId { xsd:boolean }⚓
- |
-
| <maxKwicsToHarvest> (This controls the maximum number of keyword-in-context extracts that will be stored - for each term in a document.) | -|
| Namespace | -http://hcmc.uvic.ca/ns/staticSearch | -
| Module | -ss — Schema specification and tag documentation | -
| Contained by | -
-
-
-
-
- ss: params
- |
-
| May contain | -
-
-
- XSD nonNegativeInteger
- |
-
| Note | -
- <maxKwicsToHarvest> controls the number of keyword-in-context extracts that will be harvested from the - data for each term in a document. For example, if a user searches for the word ‘elephant’, and it occurs 27 times in a document, but the <maxKwicsToHarvest> value is set to 5, then only the first five (sorted in document order) of these keyword-in-context - strings will be stored in the index. (This does not affect the score of the document - in the search results, of course.) If you set this to a low number, the size of the - JSON files will be constrained, but of course the user will only be able to see the - KWICs that have been harvested in their search results. -If <phrasalSearch> is set to true, the <maxKwicsToHarvest> setting is ignored, because phrasal searches will only work properly if all contexts - are stored. - |
-
| Content model | -
- -<content> - <dataRef name="nonNegativeInteger"/> -</content> - ⚓- |
-
| Schema Declaration | -
-
-element maxKwicsToHarvest { xsd:nonNegativeInteger }⚓
- |
-
| <maxKwicsToShow> (This controls the maximum number of keyword-in-context extracts that will be shown - in the search page for each hit document returned.) | -|
| Namespace | -http://hcmc.uvic.ca/ns/staticSearch | -
| Module | -ss — Schema specification and tag documentation | -
| Contained by | -
-
-
-
-
- ss: params
- |
-
| May contain | -
-
-
- XSD nonNegativeInteger
- |
-
| Note | -
- A user may search for multiple common words, so hundreds of hits could be found in - a single document. If the keyword-in-context strings for all these hits are shown - on the results page, it would be too long and too difficult to navigate. This setting - controls how many of those hits you want to show for each document in the result set. - |
-
| Content model | -
- -<content> - <dataRef name="nonNegativeInteger"/> -</content> - ⚓- |
-
| Schema Declaration | -
-
-element maxKwicsToShow { xsd:nonNegativeInteger }⚓
- |
-
| <minWordLength> (The minimum length of a term to be indexed. Default is 3 characters.) | -|
| Namespace | -http://hcmc.uvic.ca/ns/staticSearch | -
| Module | -ss — Schema specification and tag documentation | -
| Contained by | -
-
-
-
-
- ss: params
- |
-
| May contain | -
-
-
- XSD nonNegativeInteger
- |
-
| Note | -
- <minWordLength> specifies the minimum length in characters of a sequence of text that will be considered - to be a word worth indexing. The default is 3, on the basis that in most European - languages, words of one or two letters are typically not worth indexing, being articles, - prepositions and so on. If you set this to a lower limit for reasons specific to your - project, you should ensure that your stopword list excludes any very common words - that would otherwise make the indexing process lengthy and increase the index size. - |
-
| Content model | -
- -<content> - <dataRef name="nonNegativeInteger"/> -</content> - ⚓- |
-
| Schema Declaration | -
-
-element minWordLength { xsd:nonNegativeInteger }⚓
- |
-
| <outputFolder> (The name of the output folder into which the index data and JavaScript will be placed - in the site search. This should conform with the XML Name specification.) | -|
| Namespace | -http://hcmc.uvic.ca/ns/staticSearch | -
| Module | -ss — Schema specification and tag documentation | -
| Contained by | -
-
-
-
-
- ss: params
- |
-
| May contain | -
-
-
- XSD NCName
- |
-
| Note | -
- When the staticSearch build process creates its output, many files need to be added - to the website for which an index is being created. For convenience, all of these - files are stored in a single folder. This element is used to specify the name of that - folder. The default is staticSearch, but if you would prefer something else, you can specify it here. You may also use - this element if you are defining two different searches within the same site, so that - their files are kept in different locations. - |
-
| Content model | -
- -<content> - <dataRef name="NCName"/> -</content> - ⚓- |
-
| Schema Declaration | -
-
-element outputFolder { xsd:NCName }⚓
- |
-
| <params> (Element containing most of the settings which enable the Generator to find the target - website content and process it appropriately.) | -|
| Namespace | -http://hcmc.uvic.ca/ns/staticSearch | -
| Module | -ss — Schema specification and tag documentation | -
| Contained by | -
-
-
-
-
- ss: config
- |
-
| May contain | -- - | -
| Content model | -
- -<content> - <elementRef key="searchFile"/> - <elementRef key="versionFile" - minOccurs="0"/> - <elementRef key="stemmerFolder" - minOccurs="0"/> - <elementRef key="recurse"/> - <elementRef key="minWordLength" - minOccurs="0"/> - <elementRef key="scoringAlgorithm" - minOccurs="0"/> - <elementRef key="phrasalSearch" - minOccurs="0"/> - <elementRef key="wildcardSearch" - minOccurs="0"/> - <elementRef key="createContexts" - minOccurs="0"/> - <elementRef key="maxKwicsToHarvest" - minOccurs="0"/> - <elementRef key="maxKwicsToShow" - minOccurs="0"/> - <elementRef key="totalKwicLength" - minOccurs="0"/> - <elementRef key="kwicTruncateString" - minOccurs="0"/> - <elementRef key="stopwordsFile" - minOccurs="1" maxOccurs="1"/> - <elementRef key="dictionaryFile" - minOccurs="1" maxOccurs="1"/> - <elementRef key="outputFolder" - minOccurs="0"/> - <elementRef key="resultsPerPage" - minOccurs="0"/> - <elementRef key="resultsLimit" - minOccurs="0"/> -</content> - ⚓- |
-
| Schema Declaration | -
-
-element params { }⚓
- |
-
| <phrasalSearch> (Whether or not to support phrasal searches. If this is true, then the maxContexts - setting will be ignored, because all contexts are required to properly support phrasal - search.) | +<output> (Sets the folder into which the index data and JavaScript will be placed.) | ||
| Namespace | @@ -2807,6 +2605,46 @@Module | ss — Schema specification and tag documentation | |
| Attributes | ++ + | +||
| Contained by |
@@ -2819,50 +2657,46 @@ Appendix A.1.17 <ph | ||
| May contain | -
-
-
- XSD boolean
- |
+ Empty element | |
| Note |
- Phrasal search functionality enables your users to search for specific phrases by
- surrounding them with quotation marks ( However, if your site is very large and your user base is unlikely to use phrasal - searching, it may not be worth the additional build time and increased index size. +When the staticSearch build process creates its output, many files need to be added + to the website for which an index is being created. For convenience, all of these + files are stored in a single folder. This element is used to specify the name of that + folder. The default is staticSearch, but if you would prefer something else, you can specify it here. You may also use + this element if you are defining two different searches within the same site, so that + their files are kept in different locations. |
||
| Content model |
- +<content> - <dataRef name="boolean"/> + <empty/> </content> - ⚓+ ⚓ |
||
| Schema Declaration |
-
-element phrasalSearch { xsd:boolean }⚓
+
+element output { attribute dir { text }, empty }⚓
|
||
| <recurse> (Whether to recurse into subdirectories of the collection directory or not.) | +<params> (Element containing most of the settings which enable the Generator to find the target + website content and process it appropriately.) | ||
| Namespace | @@ -2877,7 +2711,7 @@@@ -2886,40 +2720,48 @@ | May contain |
-
XSD boolean
+
|
| Content model |
- +<content> - <dataRef name="boolean"/> + <elementRef key="searchPage"/> + <elementRef key="index" minOccurs="0"/> + <elementRef key="stopwords" minOccurs="0"/> + <elementRef key="dictionary" minOccurs="0"/> + <elementRef key="tokenizer" minOccurs="0"/> + <elementRef key="scoringAlgorithm" + minOccurs="0"/> + <elementRef key="stemmer" minOccurs="0"/> + <elementRef key="createContexts" + minOccurs="0"/> + <elementRef key="results" minOccurs="0"/> + <elementRef key="version" minOccurs="0"/> + <elementRef key="output"/> </content> - ⚓+ ⚓ |
||
| Schema Declaration |
-
-element recurse { xsd:boolean }⚓
+
+element params { }⚓
|
||
| <resultsLimit> (The maximum number of results that can be returned for any search before returning - an error; if the number of documents in a result set exceeds this number, then staticSearch - will not render the results and will provide a message saying that the search returned - too many results. This is usually set to 2000 by default, but you may want to have - a higher or lower limit, depending on the specific structure of your project.) | +<results> (Controls the configuration of the results page.) | |||||||||||||||||||||||||||||||
| Namespace | @@ -2930,71 +2772,117 @@ss — Schema specification and tag documentation | |||||||||||||||||||||||||||||||
| Contained by | -
-
-
-
-
- ss: params
- |
- |||||||||||||||||||||||||||||||
| May contain | +Attributes |
-
- XSD nonNegativeInteger
+
+
|
||||||||||||||||||||||||||||||
| Note | -
- This configuration option is meant to prevent errors for sites where a given set of - filters or search terms can return a set of document that can cause a browser's rendering - engine to fail. For smaller collections, it's unlikely that this limit would ever - be reached, but setting a limit may be helpful for large document collections, projects - that want to constrain the number of possible results, or projects with memory-intensive - or complex rendering. - |
- |||||||||||||||||||||||||||||||
| Content model | -
- -<content> - <dataRef name="nonNegativeInteger"/> -</content> - ⚓- |
- |||||||||||||||||||||||||||||||
| Schema Declaration | -
-
-element resultsLimit { xsd:nonNegativeInteger }⚓
- |
- |||||||||||||||||||||||||||||||
| <resultsPerPage> (The maximum number of document results to be displayed per page. All results are displayed - by default; setting resultsPerPage to a positive integer creates a Show More/Show - All widget at the bottom of the batch of results.) | -||
| Namespace | -http://hcmc.uvic.ca/ns/staticSearch | -|
| Module | -ss — Schema specification and tag documentation | -|
| Contained by |
@@ -3007,47 +2895,13 @@ Appendix A.1.20 <re | |
| May contain | -
-
-
- XSD nonNegativeInteger
- |
- |
| Note | -
- For most sites, where the number of results is likely to be in the low thousands, - it's perfectly practical to show all the results at once, because the staticSearch - processor is so fast. However, if you have tens of thousands of documents, and it's - possible that users will do (for example) filter-only searches that retrieve a large - proportion of them, you can constrain the number of results which are shown initially - using this setting. All the results are still generated and output to the page, but - since most of them are hidden until the ‘Show More’ or ‘Show All’ button is clicked, - the browser will render them much more quickly. - |
- |
| Content model | -
- -<content> - <dataRef name="nonNegativeInteger"/> -</content> - ⚓- |
- |
| Schema Declaration | -
-
-element resultsPerPage { xsd:nonNegativeInteger }⚓
- |
+ Empty element |
| Content model |
- +<content> <empty/> </content> - ⚓+ ⚓ |
||
| Schema Declaration |
-
-element rule { att.match.attributes, attribute weight { text }, empty }⚓
+
+element rule { att.match.attributes, attribute weight { text }, empty }⚓
|
| <scoringAlgorithm> (The scoring algorithm to use for ranking keyword results. Default is "raw" (i.e. weighted - counts)) | +<scoringAlgorithm> (The scoring algorithm to use for ranking keyword results.) | ||||||
| Namespace | @@ -3210,6 +3063,40 @@Module | ss — Schema specification and tag documentation | |||||
| Attributes | +
+
+
+
|
+ ||||||
| Contained by |
@@ -3228,73 +3115,37 @@ Appendix A.1.23 <sc
| ||||||
| Content model |
- +<content> - <valList type="closed"> - <valItem ident="raw"> - <desc>raw score</desc> - <gloss>Default: Calculate the score based off of the weighted number of - instances of a term in a text.</gloss> - </valItem> - <valItem ident="tf-idf"> - <gloss>Calculate the score based off of the tf-idf scoring algorithm.</gloss> - </valItem> - </valList> + <empty/> </content> - ⚓Legal values are: -
|
||||||
| Schema Declaration |
-
-element scoringAlgorithm { "raw" | "tf-idf" }⚓Legal values are:
-
+element scoringAlgorithm { attribute name { "raw" | "tf-idf" }?, empty }⚓
|
||||||
| <searchFile> (The search file (aka page) that will be the primary access point for the staticSearch. - Note that this page must be at the root of the collection directory.) | +<searchPage> (The search page that will be the primary access point for staticSearch. This page + may or may not exist, but its location is used for determining the collection that + will be indexed, so it must be at the root of the collection directory.) | ||||||||
| Namespace | @@ -3304,6 +3155,36 @@Module | ss — Schema specification and tag documentation | |||||||
| Attributes | +
+
+
+
|
+ ||||||||
| Contained by |
@@ -3316,11 +3197,7 @@ Appendix A.1.24 <se | ||||||||
| May contain | -
-
-
- XSD anyURI
- |
+ Empty element | |||||||
| Note | @@ -3332,25 +3209,25 @@|||||||||
| Content model |
- +<content> - <dataRef name="anyURI"/> + <empty/> </content> - ⚓+ ⚓ |
||||||||
| Schema Declaration |
-
-element searchFile { xsd:anyURI }⚓
+
+element searchPage { attribute file { text }, empty }⚓
|
||||||||
| Content model |
- +<content> <alternate minOccurs="1" maxOccurs="unbounded"> @@ -3418,26 +3295,26 @@+ ⚓ |
||
| Schema Declaration |
-
-element span { attribute lang { text }?, ( anyElement_span_1* | text )+ }⚓
+
+element span { attribute lang { text }?, ( anyElement_span_1* | text )+ }⚓
|
| <stemmerFolder> (The name of a folder inside the staticSearch /stemmers/ folder, in which the JavaScript - and XSLT implementations of stemmers can be found. If left blank, then the staticSearch + | <stemmer> (The name of a folder inside the staticSearch /stemmers/ folder, in which the JavaScript + and XSLT implementations of stemmers can be found. If not specified, then the staticSearch default English stemmer (en) will be used.) | ||||||||||
| Module | ss — Schema specification and tag documentation | ||||||||||
| Attributes | +
+
+
+
|
+ ||||||||||
| Contained by |
@@ -3460,11 +3382,7 @@ Appendix A.1.26 <st | ||||||||||
| May contain | -
-
-
- XSD NCName
- |
+ Empty element | |||||||||
| Note | @@ -3499,30 +3417,39 @@|||||||||||
| Content model |
- +<content> - <dataRef name="NCName"/> + <empty/> </content> - ⚓+ ⚓ |
||||||||||
| Schema Declaration |
-
-element stemmerFolder { xsd:NCName }⚓
+
+element stemmer
+{
+ attribute dir
+ {
+ "stemmers/en/"
+ | "stemmers/fr/"
+ | "stemmers/identity"
+ | "stemmers/stripDiacritics"
+ },
+ empty
+}⚓
|
||||||||||
| <stopwordsFile> (The relative path (from the config file) to a text file containing a list of stopwords - (words to be ignored when indexing).) | +<stopwords> (Specifies a list of stopwords--that is, words to be ignored when indexing.) | ||||||||||
| Namespace | @@ -3532,6 +3459,41 @@Module | ss — Schema specification and tag documentation | |||||||||
| Attributes | +
+
+
+
|
+ ||||||||||
| Contained by |
@@ -3544,11 +3506,7 @@ Appendix A.1.27 <st | ||||||||||
| May contain | -
-
-
- XSD anyURI
- |
+ Empty element | |||||||||
| Note | @@ -3556,10 +3514,11 @@|||||||||||
| Content model |
- +<content> - <dataRef name="anyURI"/> + <empty/> </content> - ⚓+ ⚓ |
||||||||||
| Schema Declaration |
-
-element stopwordsFile { xsd:anyURI }⚓
+
+element stopwords { attribute file { text }, empty }⚓
|
||||||||||
| <totalKwicLength> (If createContexts is set to true, then this parameter controls the length (in words) - of the harvested keyword-in-context string.) | +<tokenizer> (Configures options for the tokenizing process.) | ||||||||||
| Namespace | @@ -3600,6 +3558,46 @@Module | ss — Schema specification and tag documentation | |||||||||
| Attributes | +
+
+
+
|
+ ||||||||||
| Contained by |
@@ -3612,49 +3610,18 @@ Appendix A.1.28 <to | ||||||||||
| May contain | -
-
-
- XSD nonNegativeInteger
- |
- ||||||||||
| Note | -
- Obviously, the longer the keyword-in-context strings are, the larger the individual - index files will be, but the more useful the KWICs will be for users looking at the - search results. Note that the phrasal searching relies on the KWICs and thus longer - KWICs allow for longer phrasal searches. - |
- ||||||||||
| Content model | -
- -<content> - <dataRef name="nonNegativeInteger"/> -</content> - ⚓- |
- ||||||||||
| Schema Declaration | -
-
-element totalKwicLength { xsd:nonNegativeInteger }⚓
- |
+ Empty element | |||||||||
| <versionFile> (The relative path to a text file containing a single version identifier (such as 1.5, - 123456, or 06ad419). This will be used to create unique filenames for JSON resources, - so that the browser will not use cached versions of older index files.) | +<version> (Specifies the unique version to append to the index, so that the browser will not + use cached versions of older index files.) | ||||||||
| Namespace | @@ -3664,6 +3631,37 @@Module | ss — Schema specification and tag documentation | |||||||
| Attributes | +
+
+
+
|
+ ||||||||
| Contained by |
@@ -3676,16 +3674,12 @@ Appendix A.1.29 <ve | ||||||||
| May contain | -
-
-
- XSD anyURI
- |
+ Empty element | |||||||
| Note |
- <versionFile> enables you to specify the path to a plain-text file containing a simple version + <version> enables you to specify the path to a plain-text file containing a simple version number for the project. This might take the form of a software-release-style version number such as 1.5, or it might be a Subversion revision number or a Git commit hash. It should not contain any spaces or punctuation. If you provide a version file, the version string @@ -3700,83 +3694,18 @@ Appendix A.1.29 <ve
Content model |
- |
- -<content> - <dataRef name="anyURI"/> -</content> - ⚓- Schema Declaration |
-
- |
-
-element versionFile { xsd:anyURI }⚓
- | ||||
| <wildcardSearch> (Whether or not to support wildcard searches. Note that wildcard searches are more - effective when phrasal searching is also turned on, because the contexts available - for phrasal searches are also used to provide wildcard results.) | -|
| Namespace | -http://hcmc.uvic.ca/ns/staticSearch | -
| Module | -ss — Schema specification and tag documentation | -
| Contained by | -
-
-
-
-
- ss: params
- |
-
| May contain | -
-
-
- XSD boolean
- |
-
| Note | -
- Wildcard searching can coexist with stemmed searching, but it is especially useful - when stemming is not available, either because there is no available stemmer for the - language of the site, or because the site contains multiple languages. Unless your - site is particularly large, we recommend turning on wildcard searching, and therefore - also phrasal searching (<phrasalSearch>). - |
-
| Content model | -
- +<content> - <dataRef name="boolean"/> + <empty/> </content> - ⚓+ ⚓ |
| Schema Declaration |
-
-element wildcardSearch { xsd:boolean }⚓
+
+element version { attribute file { text }, empty }⚓
|
When describing a <context>, the label attribute names a component of the page that can be searched within (see 8.5.5 Specifying searchable contexts (search only in)).
+When describing a <context>, the label attribute names a component of the page that can be searched within (see 8.5.6 Specifying searchable contexts (search only in)).
| ssdata.boolean (Custom boolean datatype, which restricts boolean values to true or false for ease + of processing.) | +|
| Module | +ss — Schema specification and tag documentation | +
| Used by | +
+ Element:
+
|
+
| Content model | +
+ +<content> + <valList> + <valItem ident="true"/> + <valItem ident="false"/> + </valList> +</content> + ⚓+ |
+
| Declaration | +
+ +ssdata.boolean = "true" | "false"⚓+ |
+
For examples of full configuration files, see the staticSearch GitHub repository as well as the list of projects in
The configuration element has a root
The
The
The
Note that all output files will be in a directory that is a sibling to the search page. For instance, in a document collection that looks something like: +
The
Note that all output files will be in a directory that is a sibling to the search page.
+ For instance, in a document collection that looks something like:
@@ -530,15 +538,6 @@
We also require the
Finally, in order to support stemming and phrasal search effectively, it is important
- to specify a xsl folder, but you will probably want to create or customize the
- stopword list for your own project. You may also supply empty text files for these parameters
- if for example you donʼt want to use a stoplist at all.
A complex site may have two or more search pages targetting
specific types of document or content, each of which may need
its own particular search controls and indexes. This can easily
- be achieved by specifying a different
For these searches to be different from each other,
they will also probably have different contexts and rules. For
@@ -838,7 +844,7 @@
You can customize this CSS by providing your own CSS that overrides it, using Searching
loading dialog, rely on rules included in the base staticSearch CSS;
if you do remove or disable the CSS, then some features may not work properly.
Note that once your file has been processed and all this content has been added, @@ -915,7 +921,7 @@
After indexing your HTML files, the staticSearch build then generates an HTML report of helpful statistics
- and diagnostics about your document collection, which can be found in the directory specified by
Obviously you can implement this any way you like (or just ignore it), but
we also supply a small demonstration JavaScript library which implements this
functionality, called
If you are a programmer, you may want to write your own code to interact with staticSearch in some way. + To make it easier to work with the JavaScript that runs on the search page, the StaticSearch object dispatches + a number of CustomEvents which you can listen for in your own code. These are the events:
+ssInstantiated: This event is dispatched at the end of the constructor for the
+ StaticSearch object, indicating that it is instantiated and you may access its properties and methods.ssJsonRetrieved: This is dispatched when the StaticSearch object has completed its initial
+ retrieval of the core JSON required to handle searches.ssSearchStarting: This is dispatched when the user has initiated a search operation.ssFormCleared: This is dispatched when the user has cleared the search form.ssSearchCompleted: This is dispatched when a search has been completed and the results
+ displayed on the search page. This event carries additional information in its detail property:
+ an integer which is the count of the number of hit documents found.You can use these events in your code in the following way:
+ +In the case of the ssSearchCompleted event, you could do this:
Windows support has been added courtesy of a pull request from Tony Graham, and we will continue - to test and maintain it from now on.
+ +staticSearch 2.0 contains breaking changes and improvements to staticSearch. In particular,
the configuration file has been significantly re-organized; configuration files made for earlier
versions of staticSearch
The rule element is used to identify nodes in the XHTML document collection which should be
- treated in a special manner when indexed; either they might be ignored (if
The search page is a regular HTML page which forms part of your site. The only
+ important characteristic it must have is a
This is useful for static sites that create nested + directory structures (such as those generated from Jekyll or Wordpress).
+When the indexer is extracting keyword-in-context strings for each word, it uses a common-sense
- approach based on common element definitions, so that for example when it reaches the end of a paragraph,
- it will not continue into the next paragraph to get more context words. You may have special runs of
- text in your document collection which do not appear to be bounding contexts, but actually are; for
- example, you may have span elements with
A stopword is a word that will not be indexed, because it is too
+ common (
staticSearch provides a default set of common stopwords for English, which
+ you'll find in
The indexing process checks each word as it builds the index, and keeps a record
+ of all words which are not found in the configured dictionary. Though this does not have
+ any direct effect in the indexing process, all words not found in the dictionary are listed
+ in the staticSearch report (see
staticSearch provides a default dictionary in
When describing a
Values of 3 or above may be useful for European languages to exclude + common prepositions, articles, et cetera. If you set this to a lower + limit for reasons specific to your project, you should ensure that your + stopword list excludes any very common words that would otherwise make + the indexing process lengthy and increase the index size.
The search page is a regular HTML page which forms part of your site. The only
- important characteristic it must have is a
The raw score is simply the sum of all instances of a term
+ (optionally multipled by a configured weight via the
+
The tf-idf algorithm (term frequency-inverse document frequency)
+ computes the mathematical relevance of a term within a document relative to the rest
+ of the document collection. The staticSearch implementation of tf-idf basically follows the textbook definition of tf-idf:
+
/stemmers/ folder,
- in which the JavaScript and XSLT implementations
- of stemmers can be found. If left blank, then the staticSearch default English
- stemmer (en) will be used.en)
+ will be used.
+ The staticSearch project currently has only two real stemmers:
an implementation of the Porter 2 algorithm for modern English, and
@@ -1850,7 +1959,6 @@
be adding more stemmers as the project develops. However, if your
document collection is not English or French, you have a couple of options, one
hard and one easy.
-
-
-
-
Setting
Phrasal search functionality enables your users to search for specific phrases
+ by surrounding them with quotation marks ("), as in many search engines. In order
+ to support this kind of search,
However, if your site is very large and your user base is unlikely to + use phrasal searching, it may not be worth the additional build time and + increased index size.
+Wildcard searches are + more effective when phrasal searching is also turned on, because the contexts + available for phrasal searches are also used to provide wildcard results.
+Wildcard searching can coexist with stemmed searching, but it is especially
+ useful when stemming is not available, either because there is no available stemmer
+ for the language of the site, or because the site contains multiple languages.
+ Unless your site is particularly large, we recommend turning on wildcard searching,
+ and therefore also phrasal searching (
For example, if a user
+ searches for the word elephant
, and it occurs 27 times in a document, but the
+
If
The longer the keyword-in-context strings are, the larger the individual index + files will be, but the more useful the KWICs will be for users looking at the search results. + Note that the phrasal searching relies on the KWICs and thus longer KWICs allow for longer + phrasal searches.
+This parameter is particularly useful + if the language of your search page conventionally uses a different ellipsis + character. Japanese, for example, uses the 3-dot-leader character.
+
-
Note that contexts are necessary for phrasal searching or wildcard searching.
For most sites, where the number of results is likely to be in the low thousands,
+ it's perfectly practical to show all the results at once, because the staticSearch
+ processor is so fast. However, if you have tens of thousands of documents, and it's
+ possible that users will do (for example) filter-only searches that retrieve a
+ large proportion of them, you can constrain the number of results which are shown
+ initially using this setting. All the results are still generated and output to
+ the page, but since most of them are hidden until the
This configuration option is meant to prevent errors for sites where a given set of + filters or search terms can return a set of document that can cause a browser's rendering + engine to fail. For smaller collections, it's unlikely + that this limit would ever be reached, but setting a limit may be helpful + for large document collections, projects that want to constrain the number + of possible results, or projects with memory-intensive or complex rendering.
+This is set to 2000 by default, but you may want to have a higher or lower limit, + depending on the specific structure of your project.
+This should conform with the + XML Name specification.
+We strongly recommend that you ensure your target documents have id attributes for any significant divisions - so that this parameter can be used effectively. With lots of ids throughout your documents, and this parameter - turned on, each keyword-in-context in the results page will be linked directly to the section of the - document in which the hit appears, making the search results much more useful.
+When the staticSearch build process creates its output, many files need to be
+ added to the website for which an index is being created. For convenience, all of
+ these files are stored in a single folder. This element is used to specify the
+ name of that folder. The default is
Obviously, the longer the keyword-in-context strings are, the larger the individual index - files will be, but the more useful the KWICs will be for users looking at the search results. - Note that the phrasal searching relies on the KWICs and thus longer KWICs allow for longer - phrasal searches.
-elephant
, and it occurs 27 times in a document, but the
-
If
When describing a
A user may search for multiple common words, so hundreds of hits could be found in - a single document. If the keyword-in-context strings for all these hits are shown on - the results page, it would be too long and too difficult to navigate. This setting - controls how many of those hits you want to show for each document in the result set.
-The only reason you might need to specify a value for this parameter is - if the language of your search page conventionally uses a different ellipsis - character. Japanese, for example, uses the 3-dot-leader character.
-Phrasal search functionality enables your users to search for specific phrases
- by surrounding them with quotation marks ("), as in many search engines. In order
- to support this kind of search,
However, if your site is very large and your user base is unlikely to - use phrasal searching, it may not be worth the additional build time and - increased index size.
-Wildcard searching can coexist with stemmed searching, but it is especially
- useful when stemming is not available, either because there is no available stemmer
- for the language of the site, or because the site contains multiple languages.
- Unless your site is particularly large, we recommend turning on wildcard searching,
- and therefore also phrasal searching (
For most sites, where the number of results is likely to be in the low thousands,
- it's perfectly practical to show all the results at once, because the staticSearch
- processor is so fast. However, if you have tens of thousands of documents, and it's
- possible that users will do (for example) filter-only searches that retrieve a
- large proportion of them, you can constrain the number of results which are shown
- initially using this setting. All the results are still generated and output to
- the page, but since most of them are hidden until the
The rule element is used to identify nodes in the XHTML document collection which should be
+ treated in a special manner when indexed; either they might be ignored (if
This configuration option is meant to prevent errors for sites where a given set of filters or search terms - can return a set of document that can cause a browser's rendering engine to fail. For smaller collections, it's unlikely - that this limit would ever be reached, but setting a limit may be helpful for large document collections, projects that want to constrain the number - of possible results, or projects with memory-intensive or complex rendering.
-A stopword is a word that will not be indexed, because it is too
- common (
When the indexer is extracting keyword-in-context strings for each word, it uses a common-sense
+ approach based on common element definitions, so that for example when it reaches the end of a paragraph,
+ it will not continue into the next paragraph to get more context words. You may have special runs of
+ text in your document collection which do not appear to be bounding contexts, but actually are; for
+ example, you may have span elements with
The indexing process checks each word as it builds the index, and keeps a record
- of all words which are not found in the configured dictionary. Though this does not have
- any direct effect in the indexing process, all words not found in the dictionary are listed
- in the staticSearch report (see
When the staticSearch build process creates its output, many files need to be
- added to the website for which an index is being created. For convenience, all of
- these files are stored in a single folder. This element is used to specify the
- name of that folder. The default is