Plugin with additional Metafacture commands used in linked-swissbib workflows
In order to use the plugins with a standalone instance of Metafacture you have to build a "Fat Jar". For that issue the following commands in the root directory:
# Clone standalone instance of Metafacture
git clone https://github.com/linked-swissbib/mfWorkflows
# Clone this repository
git clone
https://github.com/linked-swissbib/swissbib-metafacture-commands
cd swissbib-metafacture-commands
# Build fat jar
./gradlew clean shadow # For *nix-OSes, otherwise use gradlew.bat
# Move fat jar to plugins folder of mfWorkflows
mv build/libs/swissbibMF-plugins-1.1-all.jar ../mfWorkflows/plugins
There is an experimental Docker image available which provides a standalone Metafacture instance including the linked-swissbib plugins.
docker pull sschuepbach/mfrunner-sb-5
For further instructions see here
There are only a few unit tests available (hopefully there will be more in the near future...). To run them type
./gradlew clean check
The commands are divided into several categories:
- Decoders:
- decode-json: Parses JSON files
- decode-ntriples: Parses Ntriples files
- handle-marcxml-sb: Directly transforms MARC-XML fields to CSV rows like record-id,field,indicator1,indicator2,subfield,value
- handle-marcxml-sru: Handles MARC-XML files received from the SRU interface of Swissbib
- Pipe:
- encode-esbulk: Encodes data as JSON-LD or in a special format suitable for bulk indexing in Elasticsearch
- encode-neo4j: Encodes data as csv files suitable for batch uploads to a Neo4j database
- encode-ntriples: Encodes data as Ntriples
- ext-filter: Extends the default filter command in Flux by providing a parameter to implement a "filter not" mechanism
- itemerase-es: Deletes items which belong to a certain bibliographicResource
- lookup-es: Filters out records whose identifier already exists in an Elasticsearch index
- split-entities: Splits entities into individual records.
- update-es-id: Identifies partially modified documents by comparing them to an Elasticsearch index.
- Writers:
- index-esbulk: Uses the bulk mechanisms of Elasticsearch to index records
- index-neo4j: Indexes nodes and relationships in Neo4j
- write-csv: Serialises data as CSV file with optional header.
- write-esbulk: Writes records as JSON files which can comply with the requirements of the Bulk API of Elasticsearch.
- write-kafka: Acts as a producer in a Kafka cluster.
- write-neo4j: Writes csv files for batch uploading to a new Neo4j database.
- write-rdf-1line: Writes RDF-XML files, one line per record.
- write-socket: Sets up a socket server.
- Source:
- read-kafka: Acts as a Kafka Consumer for Metafacture
- open-multi-http: Allows to open HTTP resources in a "paging" manner, e.g. to get data by chunks from a database
- Record Splitters:
- read-json-object: Reads in a JSON file and splits it at the end of the root object / array.
- Morph Functions:
- AuthorHash: Creates a hash value for authors based on different MARC fields.
- ItemHash: Creates a hash value for items based on different MARC fields.
Creates a hash value for authors based on different MARC fields.
- Implementation: org.swissbib.linked.mf.morph.functions.AuthorHash
Resources:
- Morph definition for an author name in field 100
- Morph definition for an author name in field 700
- Morph definition for an organisation name in field 710
- Morph definition for an organisation name in field 711
Parses JSON. Preferably used in conjunction with read-json-object
- Implementation: org.swissbib.linked.mf.decode.JsonDecoder
- In:
java.io.Reader
- Out: org.culturegraph.mf.framework.StreamReceiver
- Option:
nullValues
: Set if null values should not be returned as empty strings.
Parses Ntriples-encoded records.
- Implementation: org.swissbib.linked.mf.decoder.NtriplesDecoder
- In:
java.io.Reader
- Out: org.culturegraph.mf.framework.StreamReceiver
- Options:
unicodeEscapeSeq
: "true", "false"keepLanguageTags
: "true", "false". If set to true (the default), the tag is separated from the literal by two#
.keepTypeAnnotations
: "true", "false". If set to true (the default), the annotation is separated from the literal by two#
.
Example: linked-swissbib "EnrichedLine"
Encodes records for bulk uploading to Elasticsearch.
- Implementation: org.swissbib.linked.mf.pipe.ESBulkEncoder
- In: org.culturegraph.mf.framework.StreamReceiver
- Out:
java.lang.String
- Options:
- avoidMergers: If set to true, fields with same keys are modelled as separate inner objects instead of having their values merged (Boolean; default: false)
- header: Should header for ES bulk be written (Boolean; default: true)? Warning: Setting this parameter to false will result in an invalid Bulk format!
- escapeChars: Escapes prohibited characters in JSON strings (Boolean; default: true)
- index: Index name of records
- type: Type name of records
Example: linked-swissbib "Baseline"
Encodes records as csv files for batch uploading them to a new Neo4j-database. As the headers of the csv files are hardcoded, it is not ready to be used in a broader context.
- Implementation: org.swissbib.linked.mf.pipe.NeoEncoder
- In: org.culturegraph.mf.framework.StreamReceiver
- Out:
java.lang.String
Example: Graph visualisation of the GND
Encodes data as Ntriples
- Implementation: org.swissbib.linked.mf.pipe.NtriplesEncoder
- In: org.culturegraph.mf.framework.StreamReceiver
- Out:
java.lang.String
Example: Libadmin entries as Ntriples
Extends the default filter command in Flux by providing a parameter to implement a "filter not" mechanism
- Implementation: org.swissbib.linked.mf.pipe.ExtFilter
- In: org.culturegraph.mf.framework.StreamReceiver
- Out: org.culturegraph.mf.framework.StreamReceiver
- Option:
filterNot
: If set to true, returns only records which don't match a certain criteria (Boolean; default: false)
Example: Show record ids which don't have a title (MARC field 245$a)
Directly transforms MARC-XML fields to CSV rows like record-id,field,indicator1,indicator2,subfield,value
- Implementation: org.swissbib.linked.mf.decoder.MarcXmlSbHandler
- In: org.culturegraph.mf.framework.XmlReceiver
- Out:
java.lang.String
Example: 1:1 transformation of MARC-XML to CSV
Handles MARC-XML files received from the SRU interface of Swissbib
- Implementation: org.swissbib.linked.mf.decoder.MarcXmlSruHandler
- In: org.culturegraph.mf.framework.XmlReceiver
- Out: org.culturegraph.mf.framework.StreamReceiver
Indexes records in Elasticsearch.
- Implementation: org.swissbib.linked.mf.pipe.ESBulkIndexer
- In:
java.lang.Object
- Out:
java.lang.Void
- Options:
- esClustername: Elasticsearch cluster name
- recordsPerUpload: Number of records per single bulk upload
- esNodes: Elasticsearch nodes. Nodes are separated by #
Example: linked-swissbib "Baseline"
Indexes fields in Neo4j. Because the selection of the fields which are to be indexed is hardcoded, the benefit of this command outside our admittedly narrow scope is somewhat limited.
- Implementation: org.swissbib.linked.mf.writer.NeoIndexer
- In: org.culturegraph.mf.framework.StreamReceiver
- Out: org.culturegraph.mf.framework.StreamReceiver
- Options:
- batchSize: Size of batch upload for Neo4j
- dbDir: Path to Neo4j database
Creates a hash value for items based on different MARC fields.
- Implementation: org.swissbib.linked.mf.morph.functions.ItemHash
Resource: Morph definition which uses the item hash generator
Deletes items which belong to a certain bibliographicResource. Recommended for internal use only. Intended to use with the tracking framework of linked-swissbib
- Implementation: org.swissbib.linked.mf.pipe.ESItemErase
- In: org.culturegraph.mf.framework.StreamReceiver
- Out: org.culturegraph.mf.framework.StreamReceiver
- Options:
- esClustername: Elasticsearch cluster name
- esNodes: Elasticsearch nodes. Nodes are separated by #
- esIndex: Elasticsearch index
- esType: Document type for Elasticsearch
Filters out records whose identifier already exists in an Elasticsearch index. Intended to use with the tracking framework of linked-swissbib.
- Implementation: org.swissbib.linked.mf.pipe.ESLookup
- In: org.culturegraph.mf.framework.StreamReceiver
- Out: org.culturegraph.mf.framework.StreamReceiver
- Options:
- esClustername: Elasticsearch cluster name
- esNodes: Elasticsearch nodes. Nodes are separated by #
- esIndex: Elasticsearch index
- esType: Document type for Elasticsearch
Allows to open HTTP resources in a "paging" manner, e.g. to get data by chunks from a database. You have to define two variable parts in the URL: ${cs}
, which sets the chunk size, and ${pa}
, which sets the offset.
- Implementation: org.swissbib.linked.mf.source.MultiHttpOpener
- In:
java.lang.String
- Out:
java.lang.Reader
- Options:
- accept: The accept header in the form type/subtype, e.g. text/plain.
- encoding: The encoding is used to encode the output and is passed as Accept-Charset to the http connection.
- lowerBound: Initial offset
- upperBound: Limit
- chunkSize: Number of documents to be downloaded in a single retrieval
Acts as a Kafka consumer for Metafacture
- Implementation: org.swissbib.linked.mf.source.MfKafkaConsumer
- In:
java.lang.String
- Out:
java.lang.Reader
- Options:
- topics: Kafka topics (separated by #)
- groupId: Kafka group identifier
Reads in a JSON file and splits it at the end of the root object / array. Preferably used in conjunction with decode-json
- Implementation: org.swissbib.linked.mf.io.JsonObjectReader
- In:
java.lang.Reader
- Out:
java.lang.String
Example: libadmin entries as Ntriples
Splits entities into individual records.
- Implementation: org.swissbib.linked.mf.pipe.EntitySplitter
- In: org.culturegraph.mf.framework.StreamReceiver
- Out: org.culturegraph.mf.framework.StreamReceiver
- Option:
entityBoundary
: Node depth for entity splitting
Example: linked-swissbib "Baseline"
Identifies partially modified documents by comparing them to an Elasticsearch index. Is tailored to the so-called baseline workflow of linked-swissbib, so it's probably useless for other purposes
- Implementation: org.swissbib.linked.mf.pipe.ESIdUpdate
- In: org.culturegraph.mf.framework.StreamReceiver
- Out: org.culturegraph.mf.framework.StreamReceiver
- Options:
- esClustername: Elasticsearch cluster name
- esNodes: Elasticsearch nodes. Nodes are separated by #
- esIndex: Elasticsearch index
- esType: Document type for Elasticsearch
- matchingFields: Fields which should be matched. # is delimiter.
- sThreshold: Matching threshold
- refPath: Name of person / organisation field in bibliographicResoruce
- uriPrefix: Prefix for identifier (e.g. http://data.swissbib.ch/person/)
- graphDbDir: Path to Neo4j database
Serialises data as CSV file with optional header
- Implementation: org.swissbib.linked.mf.writer.ContinuousCsvWriter
- In:
java.lang.String
- Out:
java.lang.Void
- Options:
- compression: Sets the compression mode
- continuousFile: Boolean. If set to true, the header is only written to the first file.
- encoding: Sets the encoding used by the underlying writer
- filenamePostfix: By default the filename consists of a zero-filled sequential number with six digits. Sets a postfix for this number.
- filenamePrefix: By default the filename consists of a zero-filled sequential number with six digits. Sets a prefix for this number.
- filetype: File ending
- footer: Sets the footer which is output after the last object
- header: Sets the header which is output before the first object
- linesPerFile: Number of lines written to one file
- path: Path to directory with CSV files
- separator: Sets the separator which is output between objects
Writes records as JSON files which comply with the requirements of the Bulk API of Elasticsearch.
- Implementation: org.swissbib.linked.mf.writer.ESBulkWriter
- In:
java.lang.Object
- Out:
java.lang.Void
- Options:
- compress: Should files be .gz-compressed? (Default is true)
- filePrefix: Prefix for file names
- fileSize: Number of records in one file
- jsonCompliant: Should files be JSON compliant (Boolean; default: false)? Warning: Setting this parameter to true will result in an invalid Bulk format!
- outDir: Root directory for output
- subdirSize: Number of files in one subdirectory (Default: 300)
- type: Type name of records (will only be attached to filename)
Example: linked-swissbib "Baseline"
Acts as a producer in a Kafka cluster.
- Implementation: org.swissbib.linked.mf.writer.KafkaWriter
- In:
java.lang.Object
- Out:
java.lang.Void
- Options:
- host: Hostname of Kafka cluster (required)
- port: Port of Kafka cluster (required)
- topic: Name of Kafka topic (required)
Example: A very small example of using the Kafka consumer
Writes csv files for batch uploading to a new Neo4j database. Intended to be used in junction with index-neo4j.
- Implementation: org.swissbib.linked.mf.writer.NeoWriter
- In:
java.lang.Object
- Out:
java.lang.Void
- Options:
- csvDir: Path to the output directory
- csvFileLength: Numbers of records in one dedicated CSV file
- batchWriteSize: Maximal number of records of the same category
Example: Graph visualisation of the GND
Writes RDF-XML files, one line per record.
- Implementation: org.swissbib.linked.mf.writer.SingleLineWriterRDFXml
- In:
java.lang.Object
- Out:
java.lang.Void
- Options:
- usecontributor: "true", "false"
- rootTag: XML root tag
- extension: File extension for output files
- compress: Should output files be compressed? ("true", "false")
- baseOutDir: Base directory for output files:
- outFilePrefix: Prefix for output files
- fileSize: Number of records in one file
- subDirSize: Number of records in one subdirectory
- type: Concept / type name
Sets up a socket server
- Implementation: org.swissbib.linked.mf.writer.SocketWriter
- In:
java.lang.Object
- Out:
java.lang.Void
- Option:
port
: Port of socket server (host is always 127.0.0.1)
Example: Stream MARC-XML to socket