Skip to content
/ feed Public

A tool to feed generated random data of various types into ArangoDB and run other load

Notifications You must be signed in to change notification settings

arangodb/feed

Repository files navigation

feed - ArangoDB random data and load generator

This project is about putting data and load into and on an ArangoDB instance. Its purpose is to cover:

  • large data generation
  • quick data import
  • high parallel load of write and read operations
  • multiple different collection types and graphs (smart, hybrid, satellite)
  • different data scenarios (large/small documents, few/many indexes, search yes/no)
  • different write scenarios (insert/replace/update/remove/truncate)
  • different read scenarios (bulk read/random read/index read/search read)
  • graph traversals
  • all is covered by a domain specific language which covers parallel and sequential loads
  • all operations automatically measure throughput and latency and report

Example

[
normal create database=xyz collection=c numberOfShards=3 replicationFactor=3 drop=true
normal insert database=xyz collection=c parallelism=10
]

first creates a normal collection and then inserts some data into it with 10 parallel go-routines (threads).

Command line options

Here is the usage page of the tool. Endpoints can be of type http or https, they can be separated by commas or the option can be given multiple times, or both. The endpoints should be different coordinator endpoints of the same cluster.

The 'feed' tool feeds ArangoDB with generated data, quickly.

Usage:
   [flags]

Flags:
      --endpoints strings      Endpoint of server where data should be written. (default [http://localhost:8529])
      --execute string         Filename of program to execute. (default "doit.feed")
  -h, --help                   help for this command
      --jsonOutputFile string  File name of JSON report on performance which is written.
      --jwt string             Verbose output
      --metricsPort int        Metrics port (0 for no metrics) (default 8888)
      --password string        Password for database access
      --protocol string        Protocol (http1, http2, vst) (default "vst")
      --username string        User name for database access. (default "root")
  -v, --verbose                Verbose output

Misc cases

  • wait: A no-op waiting program.

Data cases

This is only an overview, see below for detailed instructions for lists of subcommands and then later for detailed instructions.

  • normal: This is a normal (vertex) collection.
  • graph: This is for graphs.
  • replayAQL: This replays a previously recorded list of AQL queries against a new deployment.

Operation cases for normal

This is only an overview, see below for detailed instructions for each subcommand.

  • create: create collection or graph
  • drop: drop a collection
  • truncate: trunacte a collection
  • insert: bulk insert data
  • createIdx: create an index
  • dropIdx: drop an index
  • randomRead: read documents randomly in parallel
  • randomUpdate: perform updates in documents randomly in parallel
  • randomReplace: perform replacements of documents in parallel
  • dropDatabase: drop a database
  • queryOnIdx: run an AQL query using an index (including primary index)
  • createView: create a view (and potentially some analyzers)
  • dropView: drop a view
  • nastyViewData: create data in a collection for a particularly nasty search view

Operation cases for graph

This is only an overview, see below for detailed instructions for each subcommand.

  • insertvertices: for graph cases, insert vertex data
  • insertedges: for graph cases, insert edge data
  • randomTraversal: for graph cases, run a traversal from a random starting point

feedlang reference

Rules:

  • Input is line based.
  • Curly braces allow grouping for parallel execution.
  • Square brackets allow grouping for sequential execution.
  • Braces and brackets need to be on a line by themselves.
  • White space at the end and the beginning of a line are ignored.
  • Lines in which the first non-white space character is # are comments.
  • Empty lines are ignored.

Example:

# Comment

[
  {
    wait 1
    wait 2
    wait 3
  }
  {
    wait 4
    wait 3
  }
]

This executes the first three wait statements concurrently, and when all three are done (i.e. after 3 seconds), moves to the last two, which are also executed concurrently (i.e. they end after another 4 seconds).

Subcommand create (for normal)

Creates a collection.

Example of usage:

normal create database=xyz collection=c numberOfShards=3 replicationFactor=2

Possible parameters for usage:

  • database: name of the database where the collection will be created (default: _system)
  • collection: name of the collection to be created (default: batchimport)
  • numberOfShards: number of shards of the collection (default: 3)
  • replicationFactor: replication factor (number of replicas for each shard) (default: 3)
  • waitForSync: boolean flag, if set to true, all write operations on this collections will wait until all data is persisted to disk (default: false)
  • drop: boolean flag, if set to true, drops collection if it exists before recreating it (default: false)
  • replicationVersion: set replication Version 1 (default) or replication Version 2 (when creating the database)

Subcommand drop (for normal)

Drops a collection.

Example of usage:

normal drop database=xyz collection=c

Possible parameters for usage:

  • database: name of the database where the collection will be dropped (default: _system)
  • collection: name of the collection which will be dropped (default: batchimport)

Subcommand truncate (for normal)

Truncates a collection.

Example of usage:

normal truncate database=xyz collection=c

Possible parameters for usage:

  • database: name of the database where the collection is truncated (default: _system)
  • collection: name of the collection which is truncated (default: batchimport)

Subcommand dropDatabase (for normal)

Drops a database.

Example of usage:

normal dropDatabase database=xyz

Possible parameters for usage:

  • database: name of the database which will be dropped

Subcommand createIdx (for normal)

Creates an index.

Example of usage:

normal createIdx database=xyz collection=c withGeo=false numberFields=4 idxName="myIdx"

Possible parameters for usage:

  • database: name of the database where the index will be created (default: _system)
  • collection: name of the collection where the index will be created (default: batchimport)
  • withGeo: whether or not it's a geo index (default: false)
  • numberFields: number of fields the index will cover (default: 1)
  • idxName: user-defined name for the index. Leading and trailing quotes are ignored. (default: "idx" + random number ex. idx123)
  • replicationVersion: set replication Version 1 (default) or replication Version 2 (when creating the database)

Subcommand dropIdx (for normal)

Drops an index.

Example of usage:

normal dropIdx database=xyz collection=c idxName=myIdx

Possible parameters for usage:

  • database: name of the database where the index will be dropped (default: _system)
  • collection: name of the collection where the index will be dropped (default: batchimport)
  • idxName: user-defined name for the index. Leading and trailing quotes are ignored.

Subcommand insert (for normal)

Inserts data into a collection with batches.

Example of usage:

normal insert database=xyz collection=c parallelism=10 size=5G documentSize=300 withGeo=false withWords=5 parallelism=16 numberFields=5 batchSize=5000

Possible parameters for usage:

  • database: name of the database where random replaces will be executed (default: _system)
  • collection: name of the collection where the random replaces will be executed (default: batchimport)
  • parallelism: number of threads to execute the replaces concurrently (default: 16)
  • batchSize: size of the batches (default: 1000)
  • startDelay: delay in milliseconds between starts of different threads, this is used to stagger the startup of the different threads (go routines) (default: 5)
  • timeout: timeout in seconds for each batch insert, this is by default very high, but can be lowered for special experiments (default: 3600)
  • retries: number of retries after an error, errors with successful retries are counted as errors, but do not lead to an abortion of the operation, by default, no retries are done (default: 0)
  • size: total size of data to be inserted, one can use suffixes G for gigabytes (1024^3), T for terabytes (1024^4), M for megabytes (1024^2) and K for kilobytes (1024), (default: 16G)
  • documentSize: size in bytes of an individual document, this is an approximation, but should be relatively OK, if you use withGeo and/or withWords, the size will be slightly larger (default: 128)
  • withGeo: if set to true, a special field called geo will be generated with a random polygon in GEOJson format (default: false)
  • withWords: if set to true, a special field called words will be generated with a list of some random words from a finite pool (default: false)
  • keySize: size in bytes of the key generated, the key will always be a SHA256 value of a stringified integer in the range from 0 to N-1 for some value of N-1 determined by the total size of the data, keySize is then used to take a prefix of that SHA256 value (hex stringification) of the appropriate number of bytes (default: 32)
  • numberFields: number of payload fields to generate, the randomly generated string data is distributed across that many fields called payload0, payload1 and so on (default: 1)
  • useAql: If set to true, any CRUD operation (create, replace, update, delete) will be executed through AQL queries instead of using the regular document HTTP API. Can only be used if addFromTo is not set to true. (default: false)
  • oneShard: If set to true the database will be created as a one shard database (default false)
  • replicationVersion: set replication Version 1 (default) or replication Version 2 (when creating the database)

The following are for edge collections to produce random _from and _to values. Do not use them in actual graph insertion commands:

  • addFromTo: if set to true, add random values for the _from and _to attributes to allow insertion into an edge collection (default: false)
  • smart: if set to true, add the necessary things for a smart edge collection, this is only relevant, if addFromTo is set to true, it changes the format of these attribute values by adding a smart graph attribute value, furthermore, the generated _key is adjusted (default: false)
  • vertexCollName: set this string value to the name of the vertex collection, which should be used in _from and _to attributes, this is only relevant if addFromTo is set to true (default: V)

Subcommand queryOnIdx (for normal)

Runs queries that will use a specific index in parallel, meaning the query will perform an operation that uses an attribute that is covered by the index so that it is actively used during the query execution.

Example of query:

FOR doc IN c 
  SORT c.attrCoveredByIdx
  FILTER c.attrCoveredByIdx >= 0
  LIMIT 1
  RETURN doc

Example of usage:

normal queryOnIdx database=xyz collection=c limit=1 parallelism=1 idxName=primary

Possible parameters for usage:

  • database: name of the database where the index queries will be executed (default: _system)
  • collection: name of the collection where the index queries will be executed (default: batchimport)
  • idxName: index name. If set to primary, the primary index will be used on the query. If it's a string, the queries will be performed on an attribute of the index with the specified name. If not present, the query will be performed in the first index that is not primary that was looked up form the collection's indexes.
  • parallelism: number of threads to execute the queries concurrently (default: 16)
  • loadPerThread: number of times each thread executes the query on the same index (default: 50)
  • queryLimit: number of documents returned from the query (default: 1), this controls whether it is a single document or a batch request.

Subcommand randomRead (for normal)

Reads single documents randomly in parallel.

Example of usage:

normal randomRead database=xyz collection=c parallelism=10 loadPerThread=50

Possible parameters for usage:

  • database: name of the database where random reads will be executed (default: _system)
  • collection: name of the collection where the random reads will be executed (default: batchimport)
  • parallelism: number of threads to execute the reads concurrently (default: 16)
  • loadPerThread: number of times each thread executes random reads (default: 50)

Subcommand randomUpdate (for normal)

Executes updates on random documents.

Example of usage:

normal randomUpdate database=xyz collection=c parallelism=10 loadPerThread=50

Possible parameters for usage:

  • database: name of the database where random updates will be executed (default: _system)
  • collection: name of the collection where the random updates will be executed (default: batchimport)
  • parallelism: number of threads to execute the updates concurrently (default: 16)
  • loadPerThread: number of times each thread executes random updates (default: 50)
  • batchSize: size of the batches (default: 1000)

For the documents being created as updates the same parameters can be used as in the insert case above.

Subcommand randomReplace (for normal)

Executes replacements of random documents.

Example of usage:

normal randomReplace database=xyz collection=c parallelism=10 loadPerThread=50

Possible parameters for usage:

  • database: name of the database where random replaces will be executed (default: _system)
  • collection: name of the collection where the random replaces will be executed (default: batchimport)
  • parallelism: number of threads to execute the replaces concurrently (default: 16)
  • loadPerThread: number of times each thread executes random replaces (default: 50)
  • batchSize: size of the batches (default: 1000)

For the documents being created as replacement the same parameters can be used as in the insert case above.

Subcommand createView (for normal)

Creates a view (with proper links) and potentially analyzers.

Example of usage:

normal createView database=xyz viewDefFile=view.json analyzersDefFile=analyzers.json view=v drop=true

Possible parameters for usage:

  • database: name of the database where the collection will be created (default: _system)
  • replicationVersion: set replication Version 1 (default) or replication Version 2 (when creating the database)
  • view: name of view to create (default: v)
  • drop: boolean flag, if set to true, drops view if it exists before recreating it (default: false)
  • viewDefFile: name of file with a JSON description of the view, see below for an example, in particular, this contains the collections for which links are generated
  • analyzersDefFile: name of file with a JSON description of the analyzers, which ought to be created, see below for an example, analyzers are created before the view, so that they can immediately be used in the view definition

Here is an example of a JSON file to create some analyzers, note in particular that it needs to be an array of objects, one for each analyzer:

[
  {
    "name": "segmentation",
    "type": "segmentation",
    "properties": {
      "case": "lower",
      "break": "alpha"
    },
    "features": [
      "frequency",
      "position",
      "norm"
    ]
  },
  {
    "name": "identity",
    "type": "identity",
    "properties": {},
    "features": [
      "frequency",
      "norm"
    ]
  },
  {
    "name": "text_en",
    "type": "text",
    "properties": {
      "locale": "en",
      "case": "lower",
      "stopwords": [],
      "accent": false,
      "stemming": true
    },
    "features": [
      "frequency",
      "position",
      "norm"
    ]
  }
]

Here is an example for a view definition file:

{
  "links": {
    "c": {
      "analyzers": [
        "identity"
      ],
      "collectionName": "transactions",
      "fields": {
        "words" : {
          "analyzers" : [
            "text_en",
            "identity"
          ],
          "cache" : true
        }
      },
      "inBackground": true,
      "includeAllFields": false,
      "name": "idx_1765991063252631552",
      "primaryKeyCache": true,
      "primarySort": [],
      "primarySortCompression": "lz4",
      "storeValues": "none",
      "trackListPositions": false,
      "type": "arangosearch",
      "version": 1
    }
  },
  "storedValues": [
    {
      "fields": [
        "_key",
        "words"
      ],
      "compression": "lz4",
      "cache": true
    }
  ]
}

Subcommand dropView (for normal)

Drops a view.

Example of usage:

normal dropView database=xyz view=v

Possible parameters for usage:

  • database: name of the database where the collection will be dropped (default: _system)
  • view: name of view to drop, note that associated links are dropped as well, but analyzers are not, since we do not know which analyzers have been created alongside this view and which have been there from before (default: v)

Subcommand nastyViewData (for normal)

Inserts data into a collection with batches, producing data which is particularly nasty for a search view.

Example of usage:

normal nastyViewData database=xyz collection=c parallelism=10 size=5G parallelism=16 batchSize=5000

Possible parameters for usage:

  • database: name of the database where random replaces will be executed (default: _system)
  • collection: name of the collection where the random replaces will be executed (default: batchimport)
  • parallelism: number of threads to execute the replaces concurrently (default: 16)
  • batchSize: size of the batches (default: 1000)
  • startDelay: delay in milliseconds between starts of different threads, this is used to stagger the startup of the different threads (go routines) (default: 5)
  • timeout: timeout in seconds for each batch insert, this is by default very high, but can be lowered for special experiments (default: 3600)
  • retries: number of retries after an error, errors with successful retries are counted as errors, but do not lead to an abortion of the operation, by default, no retries are done (default: 0)
  • size: total size of data to be inserted, one can use suffixes G for gigabytes (1024^3), T for terabytes (1024^4), M for megabytes (1024^2) and K for kilobytes (1024), (default: 16G)
  • keySize: size in bytes of the key generated, the key will always be a SHA256 value of a stringified integer in the range from 0 to N-1 for some value of N-1 determined by the total size of the data, keySize is then used to take a prefix of that SHA256 value (hex stringification) of the appropriate number of bytes (default: 32)
  • numberFields: number of stored fields to generate ("data0", "data1", ...) (default: 10)
  • oneShard: If set to true the database will be created as a one shard database (default false)
  • replicationVersion: set replication Version 1 (default) or replication Version 2 (when creating the database)

Subcommand create (for graph)

Creates a graph.

Examples of usage:

graph create database=xyz name=G vertexColl=V edgeColl=E type=cyclic graphSize=2000 

possible parameters for usage:

  • database: name of the database where the graph will be created (default: _system)
  • name: name of the graph to be created (default: G)
  • vertexColl: name of the vertex collection of the graph (default: V)
  • edgeColl: name of the edge collection of the graph (default: E)
  • type: type of graph to be created, currently we have cyclic and tree (default: cyclic)
  • graphSize: number of vertices of the graph (default: 2000)
  • graphDepth: depth of the created tree (default: 10)
  • graphBranching: branching factor of the tree (default: 2)
  • graphDirection: direction of edges in the tree, can be downwards or upwards or bidirected (default: downwards)

For the different types of graphs, the following options are relevant:

  • cyclic: only needs graphSize
  • tree: needs graphDepth, graphBranching and graphDirection.

Subcommand insertvertices (for graph)

Batch inserts the vertices of a graph.

Examples of usage:

graph insertvertices database=xyz name=G vertexColl=V type=cyclic graphSize=2000 

possible parameters for usage, see create, plus:

  • batchSize: size of the batches for the insert (default: 1000)
  • parallelism: number of threads (go-routines) to use client-side (default: 16)

For the documents being created as vertices the same parameters as above in the insert case (subcommand for normal) can be used.

Subcommand insertedges (for graph)

Batch inserts the edges of a graph.

Examples of usage:

graph insertedges database=xyz name=G edgeColl=E type=cyclic graphSize=2000 

possible parameters for usage, see create, plus:

  • batchSize: size of the batches for the insert (default: 1000)
  • parallelism: number of threads (go-routines) to use client-side (default: 16)

For the documents being created as edges the same parameters as above in the insert case (subcommand for normal) can be used.

Command replayAQL

Replays a previously recorded list of AQL queries against a new deployment. You have to switch the log topic requests to the TRACE level on all your coordinators. You can either do this on the command line with

--log.level=requests=trace

or at runtime by sending a body of

{"requests":"trace"}

to the API PUT /_admin/log/level on all coordinators.

Then use the awk script scripts/extractQueries.awk on your log as follows:

awk -f scripts/extractQueries.awk < coordinator.log > queries.jsonl

This will produce a file queries.jsonl with lines like this:

{"t":"2023-03-09T08:31:09Z", "db": "_system", "q":{"query":"FOR d IN c FILTER d.Hallo == @x RETURN d","count":false,"bindVars":{"x":12},"stream":false}}

You can then use this file as input file for feed for a command like this:

replayAQL input=queries.jsonl parallelism=8 delayByTimestamp=true

possible parameters for usage:

  • parallelism: number of threads (go-routines) to use client-side (default: 16)
  • input: name of input file in the above format
  • delayByTimestamp: a boolean parameter, for an explanation see below

This command will take the queries and use them on the current deployment which is to be tested. It uses the same database names and query parameters as before in the recording. All results are consumed until the database does not deliver any more.

The feed tool will use as many go routines (threads) as given in the parallelism parameter. Each go routine will grab a query and execute it against the deployment until it returns no more results, and then moves to the next query. Note that each query in the input is executed by exactly one of the go routines.

If the argument delayByTimestamp is false, then each go routine will try as quickly as possible to execute all queries it can grab (one by one). If, however, the argument is true, then each go routine will consider the time stamps. It will delay the executing of a query, as long as the time which has passed since starting the command is smaller than the time difference of the time stamp of the query and the very first time stamp in the file. The effect of this is essentially that the queries are executed in the original frequency (unless the parallelism is lower than the number of concurrently running queries in the recording).

Note that if you combine multiple logs from multiple coordinators, you have to merge these files and sort the input file by time stamp. The time stamps must be according to RFC3339.

About

A tool to feed generated random data of various types into ArangoDB and run other load

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages