-
Notifications
You must be signed in to change notification settings - Fork 48
Antinous producers
In addition to providing a general way to plug Jython code into PFA applications, Antinous produces models. Only k-means has been implemented.
Antinous producers adhere to the following suite of abstract interfaces in com.opendatagroup.antinous.producer.
-
A
Datasetis a source of training data, filled in Jython and used by the producer to make a model. It has at least these methods:-
revert(): Unitempties theDataset
-
-
A
Modelis what a producer makes, something that can be converted into PFA. It has at least these methods:-
pfa: AnyRefmakes a PFA cell or pool item representing the model usingJsonObject,JsonArray, and primitive types -
pfa(options: java.util.Map[String, AnyRef]): AnyRefmakes PFA with options (probably coming from Jython) -
avroType: AvroTypedeclares the Avro type of the PFA cell or pool item
-
-
A
ModelRecordextendsModeland Scala'sProductso that it can be a case class -
A
Producer[D <: Dataset, M <: Model]uses aDatasetto produce aModel. It has at least these methods:-
dataset: Dthe dataset -
optimize(): Unitupdates the state of the producer in-place to improve the model (possibly many times) -
model: Mget the current state of the model
-
-
A
JsonObject[X]is ajava.util.Map[String, X]for representingModeldata as PFA -
A
JsonArray[X]is ajava.util.List[X]for representingModeldata as PFA
The package also has a random number seed, which is used to randomize all producer algorithms. It can be set via
setRandomSeed(x: Long)
The usual procedure is to create a concrete Dataset in the global Jython namespace and fill it in the action phase, then create a Producer from that Dataset, run optimize() to make a Model and emit PFA in the end phase.
Here is an example that builds a k-means clustering model for one key in a Hadoop reducer (one segment of the whole model).
from antinous import *
from com.opendatagroup.antinous.producer.kmeans import VectorSet, KMeans
input = record(key = string, value = array(double))
output = record(segment = string,
clusters = array(record(center = array(double),
weight = double)))
segment = None
vectorSet = VectorSet()
def action(input):
global segment, vectorSet
segment = input.key
vectorSet.add(input.value)
def end():
if segment is not None:
kmeans = KMeans(3, vectorSet)
kmeans.optimize()
emit({"segment": segment, "clusters": kmeans.model().pfa()})In package com.opendatagroup.antinous.producer.kmeans,
-
VectorSetis aDatasetwith anadd(pos: java.lang.Iterable[Double], weight: Double)method for adding points with optional weights. -
ClusterSet(clusters: java.util.List[Cluster])is aModel -
Cluster(center: java.util.List[Double], weight: Double, covariance: java.util.List[java.util.List[Double]])is aModelRecordthat takes options-
weight: if true, show the weight -
covariance: if true, show the covariance -
totalVariance: if true, show the total variance -
determinant: if true, show the determinant -
limitDimensions: if a list of integers, only present the dimensions specified incovariance,totalVariance, anddeterminant
-
-
KMeans(numberOfClusters: Int, dataset: VectorSet)is aProducer[VectorSet, ClusterSet]with the following methods:model: ClusterSet-
metric: MetricandsetMetric(m: Metric) -
stoppingCondition: StoppingConditionandsetStoppingCondition(s: StoppingCondition) -
randomClusters(): pick random initial clusters (done automatically by constructor) -
optimize()andoptimize(subsampleSize: Int)to perform k-means on a random subset, using themetricand stopping whenstoppingConditionis met.
Metrics adhere to interface Metric and can be constructed with:
EuclideanSquaredEuclideanChebyshevTaxicabMinkowski(p: Double)-
M(f: PyFunction)wherefis any Jython function that takes two Python lists of numbers
Stopping conditions adhere to interface StoppingCondition and can be constructed with:
-
MaxIterations(max: Int)triggers when the iteration number reaches or exceeds a given maximum -
Movingtriggers when all changes are below a threshold of 1e-15 BelowThreshold(threshold: Double)-
HalfBelowThreshold(threshold: Double)triggers when half the clusters' changes are below a given threshold -
WhenAll(conditions: java.lang.Iterable[StoppingCondition])triggers when all subconditions are met -
WhenAny(conditions: java.lang.Iterable[StoppingCondition])triggers when any subconditions are met -
PrintValue(numberFormat: String = "%g")does not actually stop iteration, but prints out the current values -
PrintValue(numberFormat: String = "%g")does not actually stop iteration, but prints out the last changes -
S(f: PyFunction)wherefis a Python function that takes- iteration number (
int) - model (
ClusterSet) - changes (
listoflistsof numbers)
- iteration number (
Return to the Hadrian wiki table of contents.
Licensed under the Hadrian Personal Use and Evaluation License (PUEL).