OpenTelemetry Financial Transaction Simulator

Purpose

This simulator generates realistic OpenTelemetry metrics, logs, and traces for a financial transaction processing system. It is used for:

Local development: Testing Mirador Core's correlation and RCA engines with realistic telemetry data
Demo scenarios: Showcasing platform capabilities with domain-specific observability patterns
Load testing: Generating controlled telemetry volumes for performance testing

Overview

This simulator generates realistic OpenTelemetry metrics, logs, and traces for a financial transaction processing system. It is fully configuration-driven:

Telemetry naming and outputs are configured through simulator-config.yaml.
Failure scenarios (e.g., bursty failures) are configurable in simulator-config.yaml.
Telemetry outputs can be OTLP, stdout, or both — controlled by telemetry.outputs in the YAML.

The simulator is primarily intended for local development, demo scenarios, and load testing.

Usage

Running Locally

# Build the simulator
go build -o bin/otel-fintrans-simulator cmd/otel-fintrans-simulator/main.go

# Run with default settings (sends to localhost:4317)
./bin/otel-fintrans-simulator

# Run with custom OTLP endpoint (configured via simulator-config.yaml)
# Edit `simulator-config.yaml` and set `telemetry.endpoint: "http://my-collector:4317"` and optionally `telemetry.insecure: true`.
# Then run normally:
./bin/otel-fintrans-simulator

# Run with specific transaction rate
TRANSACTION_RATE=100 ./bin/otel-fintrans-simulator

# Use a deterministic RNG seed for reproducible runs
./bin/otel-fintrans-simulator --rand-seed 12345

# Print simulator logs to stdout instead of no-op logging
./bin/otel-fintrans-simulator --log-output stdout

Helper scripts
--------------
We've added small helper scripts under `scripts/` to make local runs and scenario testing easier. They are convenience wrappers that will build the simulator binary if missing and run the desired scenario(s).

Make them executable first (one-time):

```bash
chmod +x ./scripts/*.sh

Common helper scripts

./scripts/build_and_run.sh [args...] — Build (if needed) and run a simulator binary with any arguments you pass through. E.g.:

./scripts/build_and_run.sh --config simulator-config.yaml --log-output stdout --signal-time-interval=5s

./scripts/run_examples.sh list — list all example scenario YAML files shipped in examples/scenarios.
./scripts/run_examples.sh run <name|path> — run a particular scenario (delegates to examples/run_scenario.sh). Example:

./scripts/run_examples.sh run cassandra_disk_pressure

./scripts/run_examples.sh run-all — sequentially runs all example scenarios quickly using lightweight defaults (short run lengths and reduced transaction volumes) — handy for smoke-testing.
./scripts/gen_varied_scenarios.sh — generates short/long/ramp variants for every scenario and writes them to examples/generated/ so you can quickly test variant behaviours without editing original files.

Example: generate variants and run one

./scripts/gen_varied_scenarios.sh
./scripts/run_examples.sh run examples/generated/cassandra_disk_pressure.short.yaml


### Metric export interval

You can control how often the simulator collects and exports metrics to the configured exporters (OTLP/stdout) with the `--signal-time-interval` flag. The value is a Go duration string (for example `15s`, `30s`, `1m`). The default is `15s`.

Examples:

```bash
# default (15s)
./bin/otel-fintrans-simulator --signal-time-interval=15s

# set to 30 seconds
./bin/otel-fintrans-simulator --signal-time-interval=30s

# set to 1 minute
./bin/otel-fintrans-simulator --signal-time-interval=1m

# using `go run` with a custom interval
go run . --signal-time-interval=15s

For testing dense, continuous time series (recommended when you want good rate() and histogram results):

# Example: 300 transactions spread over 5 minutes with 10s data and export intervals
./bin/otel-fintrans-simulator \
  --transactions=300 \
  --time-window=5m \
  --data-interval=10s \
  --signal-time-interval=10s \
  --concurrency=10 \
  --failure-mode=mixed \
  --failure-rate=0.2 \
  --config=simulator-config.yaml \
  --log-output=stdout

This produces frequent, evenly spaced metric points for 5 minutes so PromQL functions like rate(...[1m]) and histogram_quantile(...) have dense data to operate on.

Note: extremely short intervals may increase CPU/network load; pick an interval appropriate for your testing scenario.

Configuration

Environment variables:

Telemetry endpoint & protocol

telemetry.endpoint: OTLP collector endpoint (default: localhost:4317 when not set in config). The simulator supports both gRPC (default port 4317) and HTTP/OTLP (default port 4318).
telemetry.insecure: when true, use plaintext (no TLS) for the selected protocol (default: true).
telemetry.skip_tls_verify: when using TLS, set to true to skip certificate verification (InsecureSkipVerify). Default: false. Validation & helpful warnings

The simulator performs lightweight validation of your telemetry settings at startup and logs warnings for inconsistent combinations. Examples:

telemetry.endpoint uses http:// but telemetry.insecure=false — HTTP is plaintext; either set telemetry.insecure: true or use https:// for TLS.
telemetry.endpoint uses https:// but telemetry.insecure=true — that's inconsistent; either set telemetry.insecure: false to use TLS or change the endpoint scheme to http://.
telemetry.skip_tls_verify is ignored when telemetry.insecure is true (plaintext). Telemetry outputs

Telemetry outputs are configured via the telemetry.outputs field in simulator-config.yaml (no CLI override).

Supported values (single or combined):

otlp — send traces, metrics and logs to the configured OTLP endpoint (default)
stdout — export traces + metrics to stdout (pretty-printed) and print logs to stdout
both — export to both OTLP and stdout

Examples:

Use the default OTLP exporter (no change): keep telemetry.outputs empty / absent and the simulator will send telemetry to the OTLP endpoint.
Use stdout-only or both: edit simulator-config.yaml and add telemetry.outputs: ["stdout"] or telemetry.outputs: ["otlp","stdout"] for the desired behavior (then start simulator normally).

OTEL_SERVICE_NAME: Service name for root traces (default: api-gateway)
TRANSACTION_RATE: Transactions per second (default: 10)
ERROR_RATE: Percentage of failed transactions (default: 5)
SIMULATION_DURATION: How long to run (default: unlimited)

Configuration file (YAML)

The simulator now supports an optional YAML configuration file that controls telemetry names and failure scheduling.

By default the example config shipped with the tool is simulator-config.yaml (in this folder). Use --config to point to a custom config file:

# Use a custom YAML config
./bin/otel-fintrans-simulator --config ./cmd/otel-fintrans-simulator/simulator-config.yaml

The failure section supports a bursty mode and a list of bursts where the failure rate is multiplied for a time window. This enables more realistic, correlated failures.

Dynamic metric declarations

The simulator can now create extra metrics at startup driven purely by configuration using telemetry.dynamic_metrics. This enables teams to add new gauges, counters or histograms without changing code. Example:

telemetry:
  dynamic_metrics:
    - name: cassandra_disk_pressure
      type: gauge
      dataType: float
      description: "Synthetic disk pressure metric"

    - name: api_request_latency_seconds
      type: histogram
      dataType: float
      buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]

The simulator will validate the dynamic metric schema and create OTEL instruments at startup. Recording can be configured via scenarios or the simulator will emit sample values for gauge/histogram types.

Fully dynamic (all built-in metrics)

The simulator now registers all built-in instruments via the dynamic MetricRegistry at startup. That means:

You can override any of the default metric names using telemetry.metric_names in the YAML; the registry will create instruments using the effective names at startup.
You can add entirely new metrics via telemetry.dynamic_metrics and the simulator will create and expose those instruments at startup without any code changes.
Runtime recording prefers registry-backed handles so the simulator supports a fully dynamic telemetry surface. If a metric is declared in dynamic_metrics it will be available to scenarios and background generators.

This enables teams to add or rename KPIs and instrumentation without modifying the simulator binary — edit the YAML and restart.

Configuration example

The bundled simulator-config.yaml (in this folder) contains a compact example which demonstrates:

Overriding service_names used in spans/attributes
Custom metric_names for all instrumented metrics
A failure section that sets a base rate, chooses a mode (bursty recommended) and one or more bursts with start, duration, and multiplier values

Behavior notes

If --config is not provided or the failure section is absent, the simulator falls back to the CLI flags --failure-rate and --failure-mode (original behaviour).
If the YAML failure.seed is set, the simulator seeds randomness for deterministic runs, which is useful for reproducible demos/tests.

Failure scenarios (config-driven)

The simulator supports richer, configuration-driven scenario injection. Use the failure.scenarios block in simulator-config.yaml to declare correlated, multi-metric scenarios. Each scenario contains a start, duration, optional labels (to scope the scenario to specific label values) and a list of effects.

An effect targets a named simulator dimension or metric and uses one of the following operations:

scale — multiply the target by the specified value
add — add the specified value
set — set the target to the given value
ramp — increment the target by step on each simulation tick

Example (see simulator-config.yaml in repo):

failure:
  scenarios:
    - name: "db_slow_cascade"
      start: "5s"
      duration: "60s"
      labels:
        OrgId: ["bank_01", "bank_02"]
      effects:
        - metric: "db_latency"
          op: "scale"
          value: 5.0
        - metric: "jvm_gc"
          op: "scale"
          value: 3.0
        - metric: "transaction_failures"
          op: "scale"
          value: 4.0

    - name: "bank03_outage"
      start: "20s"
      duration: "40s"
      labels:
        OrgId: ["bank_03"]
      effects:
        - metric: "kafka_controller_UnderReplicatedPartitions"
          op: "add"
          value: 2
        - metric: "transaction_failures"
          op: "scale"
          value: 8.0

When a scenario is active the simulator applies its effects to the runtime state during each background tick. You can mix bursts (simple failure-rate multipliers) with scenario windows for rich, realistic fault patterns.

Hardware-fault scenarios

In addition to service- and KPI-focused scenarios, the simulator now supports hardware/infra-fault style effects. These simulate problems such as disk failures impacting Kafka or a bad memory module impacting in-memory datastores (KeyDB/valkey). Example metric names you can use in scenario effects include:

kafka_disk_failure — drives increased Kafka produce/consume errors and ISR noise
keydb_memory_fault / valkey_bad_memory — drives KeyDB/valkey operation failures and increases redis memory/error signals

Use these effects to model outages that originate in underlying infrastructure (hardware, nodes, network) rather than just service deployments.

Network-fault scenarios

We also support network-specific scenarios to simulate packet drops and network-induced latency — useful when failures originate from unreliable network interfaces, congested links, or router problems. Typical metric names for scenario effects:

network_latency / node_network_latency_ms — scales up simulated network latency (affects produce/consume and API gateway processing)
network_packet_drop / node_network_packet_drops_total — increases packet drop counts and causes higher messaging errors

When these scenarios are active the simulator increases network latency on affected nodes and emits packet drop counters. That also increases Kafka/consumer errors and may cascade into higher transaction failures.

Scenario examples — copy/paste ready

Below are practical, ready-to-use scenario YAML snippets you can copy into failure.scenarios in your simulator-config.yaml. These show how to simulate common outage classes — service deployment problems, hardware failures, memory faults, and network problems.

Database slowdown / deployment outage

- name: "db_slow_cascade"
  start: "5s"
  duration: "60s"
  labels:
    OrgId: ["bank_01", "bank_02"]
  effects:
    - metric: "db_latency"
      op: "scale"
      value: 5.0
    - metric: "transaction_failures"
      op: "scale"
      value: 4.0