Skip to content

Classification Engine

WinnowTag edited this page Sep 14, 2010 · 1 revision

The Classification Engine is the core component within the classifier. It manages an internal queue of classification jobs and a number of worker threads that process those jobs. It also maintains an index of current jobs referenced by their job id.

Typically a job is added to the classification engine by the HTTP server front end. When this happens the job is assigned a UUID as its id and added to the back of the job queue. When a worker thread is available it takes the first job on the front of the queue. Using the tag URL that is specified in the job, the worker fetches the tagger from the Tagger Cache and uses it to classify each item in the Item Cache. Once all the items have been classified, the worker filters out the items that were classified below the probability threshold (default of 0.9) and then sends the resulting list of items and their classification probabilities to the call-back URL provided by the tagger. The worker thread is then finished with the job and can move on the next job in the queue.

Different Job Types

There are two different job types within the classification engine:

  • All items job: This type of job will classify all the items in the in-memory item cache. It is the default job type.
  • New items job: This type of job will only classify items that have been added to the in-memory cache since the last time the given tag was classified. This is the type of job that is created in response to cache updating operations.

Error Reporting

Errors are reported using error constants and error messages. The following errors constants are supported:

  • NO_SUCH_TAG: The tag could not be found or the tag document could not be retrieved.

Parameters

The following command line parameters are relevant to the behaviour of the classification engine.


-n, --worker-threads N
                    number of threads for processing jobs
                    Default: 1

-t, --positive-threshold N
                    probability threshold for considering a tag to be
                    applied to and item
                    Default: 0

    --performance-log FILE
                    location of the file in which to write job timings

    --tag-index URL
                    URL which provides an index of the tags to classify

    --missing-item-timeout N
                    Number of seconds to wait for missing items to be added
                    before a job depending on them is canceled and an error
                    is returned instead                     Default: 60 seconds

Clone this wiki locally