This simulator generates realistic OpenTelemetry metrics, logs, and traces for a financial transaction processing system. It is used for:
- Local development: Testing Mirador Core's correlation and RCA engines with realistic telemetry data
- Demo scenarios: Showcasing platform capabilities with domain-specific observability patterns
- Load testing: Generating controlled telemetry volumes for performance testing
This simulator generates realistic OpenTelemetry metrics, logs, and traces for a financial transaction processing system. It is fully configuration-driven:
- Telemetry naming and outputs are configured through
simulator-config.yaml. - Failure scenarios (e.g., bursty failures) are configurable in
simulator-config.yaml. - Telemetry outputs can be OTLP, stdout, or both — controlled by
telemetry.outputsin the YAML.
The simulator is primarily intended for local development, demo scenarios, and load testing.
# Build the simulator
go build -o bin/otel-fintrans-simulator cmd/otel-fintrans-simulator/main.go
# Run with default settings (sends to localhost:4317)
./bin/otel-fintrans-simulator
# Run with custom OTLP endpoint (configured via simulator-config.yaml)
# Edit `simulator-config.yaml` and set `telemetry.endpoint: "http://my-collector:4317"` and optionally `telemetry.insecure: true`.
# Then run normally:
./bin/otel-fintrans-simulator
# Run with specific transaction rate
TRANSACTION_RATE=100 ./bin/otel-fintrans-simulator
# Use a deterministic RNG seed for reproducible runs
./bin/otel-fintrans-simulator --rand-seed 12345
# Print simulator logs to stdout instead of no-op logging
./bin/otel-fintrans-simulator --log-output stdout
Helper scripts
--------------
We've added small helper scripts under `scripts/` to make local runs and scenario testing easier. They are convenience wrappers that will build the simulator binary if missing and run the desired scenario(s).
Make them executable first (one-time):
```bash
chmod +x ./scripts/*.shCommon helper scripts
./scripts/build_and_run.sh [args...]— Build (if needed) and run a simulator binary with any arguments you pass through. E.g.:
./scripts/build_and_run.sh --config simulator-config.yaml --log-output stdout --signal-time-interval=5s./scripts/run_examples.sh list— list all example scenario YAML files shipped inexamples/scenarios../scripts/run_examples.sh run <name|path>— run a particular scenario (delegates toexamples/run_scenario.sh). Example:
./scripts/run_examples.sh run cassandra_disk_pressure-
./scripts/run_examples.sh run-all— sequentially runs all example scenarios quickly using lightweight defaults (short run lengths and reduced transaction volumes) — handy for smoke-testing. -
./scripts/gen_varied_scenarios.sh— generates short/long/ramp variants for every scenario and writes them toexamples/generated/so you can quickly test variant behaviours without editing original files.
Example: generate variants and run one
./scripts/gen_varied_scenarios.sh
./scripts/run_examples.sh run examples/generated/cassandra_disk_pressure.short.yaml
### Metric export interval
You can control how often the simulator collects and exports metrics to the configured exporters (OTLP/stdout) with the `--signal-time-interval` flag. The value is a Go duration string (for example `15s`, `30s`, `1m`). The default is `15s`.
Examples:
```bash
# default (15s)
./bin/otel-fintrans-simulator --signal-time-interval=15s
# set to 30 seconds
./bin/otel-fintrans-simulator --signal-time-interval=30s
# set to 1 minute
./bin/otel-fintrans-simulator --signal-time-interval=1m
# using `go run` with a custom interval
go run . --signal-time-interval=15s
For testing dense, continuous time series (recommended when you want good rate() and histogram results):
# Example: 300 transactions spread over 5 minutes with 10s data and export intervals
./bin/otel-fintrans-simulator \
--transactions=300 \
--time-window=5m \
--data-interval=10s \
--signal-time-interval=10s \
--concurrency=10 \
--failure-mode=mixed \
--failure-rate=0.2 \
--config=simulator-config.yaml \
--log-output=stdoutThis produces frequent, evenly spaced metric points for 5 minutes so PromQL functions like rate(...[1m]) and histogram_quantile(...) have dense data to operate on.
Note: extremely short intervals may increase CPU/network load; pick an interval appropriate for your testing scenario.
Environment variables:
Telemetry endpoint & protocol
telemetry.endpoint: OTLP collector endpoint (default:localhost:4317when not set in config). The simulator supports both gRPC (default port 4317) and HTTP/OTLP (default port 4318).telemetry.insecure: when true, use plaintext (no TLS) for the selected protocol (default: true).telemetry.skip_tls_verify: when using TLS, set totrueto skip certificate verification (InsecureSkipVerify). Default: false. Validation & helpful warnings
The simulator performs lightweight validation of your telemetry settings at startup and logs warnings for inconsistent combinations. Examples:
telemetry.endpointuseshttp://buttelemetry.insecure=false— HTTP is plaintext; either settelemetry.insecure: trueor usehttps://for TLS.telemetry.endpointuseshttps://buttelemetry.insecure=true— that's inconsistent; either settelemetry.insecure: falseto use TLS or change the endpoint scheme tohttp://.telemetry.skip_tls_verifyis ignored whentelemetry.insecureistrue(plaintext). Telemetry outputs
Telemetry outputs are configured via the telemetry.outputs field in simulator-config.yaml (no CLI override).
Supported values (single or combined):
otlp— send traces, metrics and logs to the configured OTLP endpoint (default)stdout— export traces + metrics to stdout (pretty-printed) and print logs to stdoutboth— export to both OTLP and stdout
Examples:
-
Use the default OTLP exporter (no change): keep
telemetry.outputsempty / absent and the simulator will send telemetry to the OTLP endpoint. -
Use stdout-only or both: edit
simulator-config.yamland addtelemetry.outputs: ["stdout"]ortelemetry.outputs: ["otlp","stdout"]for the desired behavior (then start simulator normally).
OTEL_SERVICE_NAME: Service name for root traces (default:api-gateway)TRANSACTION_RATE: Transactions per second (default:10)ERROR_RATE: Percentage of failed transactions (default:5)SIMULATION_DURATION: How long to run (default: unlimited)
The simulator now supports an optional YAML configuration file that controls telemetry names and failure scheduling.
By default the example config shipped with the tool is simulator-config.yaml (in this folder). Use --config to point to a custom config file:
# Use a custom YAML config
./bin/otel-fintrans-simulator --config ./cmd/otel-fintrans-simulator/simulator-config.yamlThe failure section supports a bursty mode and a list of bursts where the failure rate is multiplied for a time window. This enables more realistic, correlated failures.
The simulator can now create extra metrics at startup driven purely by configuration using telemetry.dynamic_metrics. This enables teams to add new gauges, counters or histograms without changing code. Example:
telemetry:
dynamic_metrics:
- name: cassandra_disk_pressure
type: gauge
dataType: float
description: "Synthetic disk pressure metric"
- name: api_request_latency_seconds
type: histogram
dataType: float
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]The simulator will validate the dynamic metric schema and create OTEL instruments at startup. Recording can be configured via scenarios or the simulator will emit sample values for gauge/histogram types.
The simulator now registers all built-in instruments via the dynamic MetricRegistry at startup. That means:
- You can override any of the default metric names using
telemetry.metric_namesin the YAML; the registry will create instruments using the effective names at startup. - You can add entirely new metrics via
telemetry.dynamic_metricsand the simulator will create and expose those instruments at startup without any code changes. - Runtime recording prefers registry-backed handles so the simulator supports a fully dynamic telemetry surface. If a metric is declared in
dynamic_metricsit will be available to scenarios and background generators.
This enables teams to add or rename KPIs and instrumentation without modifying the simulator binary — edit the YAML and restart.
The bundled simulator-config.yaml (in this folder) contains a compact example which demonstrates:
- Overriding
service_namesused in spans/attributes - Custom
metric_namesfor all instrumented metrics - A
failuresection that sets a baserate, chooses a mode (burstyrecommended) and one or moreburstswithstart,duration, andmultipliervalues
- If
--configis not provided or thefailuresection is absent, the simulator falls back to the CLI flags--failure-rateand--failure-mode(original behaviour). - If the YAML
failure.seedis set, the simulator seeds randomness for deterministic runs, which is useful for reproducible demos/tests.
The simulator supports richer, configuration-driven scenario injection. Use the failure.scenarios block in simulator-config.yaml to declare correlated, multi-metric scenarios. Each scenario contains a start, duration, optional labels (to scope the scenario to specific label values) and a list of effects.
An effect targets a named simulator dimension or metric and uses one of the following operations:
scale— multiply the target by the specified valueadd— add the specified valueset— set the target to the given valueramp— increment the target bystepon each simulation tick
Example (see simulator-config.yaml in repo):
failure:
scenarios:
- name: "db_slow_cascade"
start: "5s"
duration: "60s"
labels:
OrgId: ["bank_01", "bank_02"]
effects:
- metric: "db_latency"
op: "scale"
value: 5.0
- metric: "jvm_gc"
op: "scale"
value: 3.0
- metric: "transaction_failures"
op: "scale"
value: 4.0
- name: "bank03_outage"
start: "20s"
duration: "40s"
labels:
OrgId: ["bank_03"]
effects:
- metric: "kafka_controller_UnderReplicatedPartitions"
op: "add"
value: 2
- metric: "transaction_failures"
op: "scale"
value: 8.0When a scenario is active the simulator applies its effects to the runtime state during each background tick. You can mix bursts (simple failure-rate multipliers) with scenario windows for rich, realistic fault patterns.
In addition to service- and KPI-focused scenarios, the simulator now supports hardware/infra-fault style effects. These simulate problems such as disk failures impacting Kafka or a bad memory module impacting in-memory datastores (KeyDB/valkey). Example metric names you can use in scenario effects include:
kafka_disk_failure— drives increased Kafka produce/consume errors and ISR noisekeydb_memory_fault/valkey_bad_memory— drives KeyDB/valkey operation failures and increases redis memory/error signals
Use these effects to model outages that originate in underlying infrastructure (hardware, nodes, network) rather than just service deployments.
We also support network-specific scenarios to simulate packet drops and network-induced latency — useful when failures originate from unreliable network interfaces, congested links, or router problems. Typical metric names for scenario effects:
network_latency/node_network_latency_ms— scales up simulated network latency (affects produce/consume and API gateway processing)network_packet_drop/node_network_packet_drops_total— increases packet drop counts and causes higher messaging errors
When these scenarios are active the simulator increases network latency on affected nodes and emits packet drop counters. That also increases Kafka/consumer errors and may cascade into higher transaction failures.
Below are practical, ready-to-use scenario YAML snippets you can copy into failure.scenarios in your simulator-config.yaml. These show how to simulate common outage classes — service deployment problems, hardware failures, memory faults, and network problems.
- Database slowdown / deployment outage
- name: "db_slow_cascade"
start: "5s"
duration: "60s"
labels:
OrgId: ["bank_01", "bank_02"]
effects:
- metric: "db_latency"
op: "scale"
value: 5.0
- metric: "transaction_failures"
op: "scale"
value: 4.0Run this scenario (one-liner):
cat > /tmp/db_slow_cascade.yaml <<'YAML'
failure:
scenarios:
- name: "db_slow_cascade"
start: "0s"
duration: "60s"
labels:
OrgId: ["bank_01", "bank_02"]
effects:
- metric: "db_latency"
op: "scale"
value: 5.0
- metric: "transaction_failures"
op: "scale"
value: 4.0
YAML
# start simulator with the scenario (stdout) -- use rand-seed for reproducibility
TRANSACTION_RATE=50 ./bin/otel-fintrans-simulator --config /tmp/db_slow_cascade.yaml --log-output stdout --rand-seed 12345- Kafka disk / storage failure (hardware-originating outage)
- name: "kafka_disk_issue"
start: "0s"
duration: "90s"
labels:
OrgId: ["bank_01"]
effects:
- metric: "kafka_disk_failure" # simulator maps this to higher kafka errors + ISR noise
op: "scale"
value: 4.0
- metric: "kafka_controller_UnderReplicatedPartitions"
op: "add"
value: 10Run this scenario (one-liner):
cat > /tmp/kafka_disk_issue.yaml <<'YAML'
failure:
scenarios:
- name: "kafka_disk_issue"
start: "0s"
duration: "90s"
labels:
OrgId: ["bank_01"]
effects:
- metric: "kafka_disk_failure"
op: "scale"
value: 4.0
- metric: "kafka_controller_UnderReplicatedPartitions"
op: "add"
value: 10
YAML
TRANSACTION_RATE=50 ./bin/otel-fintrans-simulator --config /tmp/kafka_disk_issue.yaml --log-output stdout --rand-seed 12345- Cassandra gradual disk-pressure (simulates disk filling up causing compaction backlog, IO pressure and downstream latency/failures)
- name: "cassandra_disk_pressure_gradual"
start: "10s"
duration: "3m"
labels:
OrgId: ["bank_02"]
effects:
- metric: "cassandra_disk_pressure" # synthetic, ramps gradually
op: "ramp"
step: 0.2
- metric: "node_filesystem_avail_bytes" # available bytes reduced (simulated)
op: "scale"
value: 0.35
- metric: "cassandra_compaction_pending_tasks" # compaction backlog grows
op: "add"
value: 5
- metric: "db_latency"
op: "scale"
value: 3.0
- metric: "transaction_latency_seconds"
op: "scale"
value: 2.0
- metric: "transactions_failed_total"
op: "scale"
value: 4.0Run this scenario (one-liner):
cat > /tmp/cassandra_disk_pressure.yaml <<'YAML'
failure:
scenarios:
- name: "cassandra_disk_pressure_gradual"
start: "10s"
duration: "3m"
labels:
OrgId: ["bank_02"]
effects:
- metric: "cassandra_disk_pressure"
op: "ramp"
step: 0.2
- metric: "node_filesystem_avail_bytes"
op: "scale"
value: 0.35
- metric: "cassandra_compaction_pending_tasks"
op: "add"
value: 5
- metric: "db_latency"
op: "scale"
value: 3.0
- metric: "transaction_latency_seconds"
op: "scale"
value: 2.0
- metric: "transactions_failed_total"
op: "scale"
value: 4.0
YAML
TRANSACTION_RATE=50 ./bin/otel-fintrans-simulator --config /tmp/cassandra_disk_pressure.yaml --log-output stdout --rand-seed 12345- KeyDB / valkey memory fault (bad RAM causing in-memory DB failures)
- name: "keydb_memory_corruption"
start: "0s"
duration: "2m"
labels:
OrgId: ["bank_02"]
effects:
- metric: "keydb_memory_fault" # increases KeyDB failures and redis miss/eviction noise
op: "scale"
value: 5.0
- metric: "redis_memory"
op: "scale"
value: 2.5Run this scenario (one-liner):
cat > /tmp/keydb_memory_corruption.yaml <<'YAML'
failure:
scenarios:
- name: "keydb_memory_corruption"
start: "0s"
duration: "2m"
labels:
OrgId: ["bank_02"]
effects:
- metric: "keydb_memory_fault"
op: "scale"
value: 5.0
- metric: "redis_memory"
op: "scale"
value: 2.5
YAML
TRANSACTION_RATE=50 ./bin/otel-fintrans-simulator --config /tmp/keydb_memory_corruption.yaml --log-output stdout --rand-seed 12345- Network packet loss
- name: "network_packet_loss"
start: "30s"
duration: "90s"
labels:
OrgId: ["bank_03"]
effects:
- metric: "network_packet_drop"
op: "scale"
value: 6.0
- metric: "node_network_packet_drops_total"
op: "add"
value: 10Run this scenario (one-liner):
cat > /tmp/network_packet_loss.yaml <<'YAML'
failure:
scenarios:
- name: "network_packet_loss"
start: "0s"
duration: "90s"
labels:
OrgId: ["bank_03"]
effects:
- metric: "network_packet_drop"
op: "scale"
value: 6.0
- metric: "node_network_packet_drops_total"
op: "add"
value: 10
YAML
TRANSACTION_RATE=50 ./bin/otel-fintrans-simulator --config /tmp/network_packet_loss.yaml --log-output stdout --rand-seed 12345- Network latency spike
- name: "network_latency_spike"
start: "45s"
duration: "1m30s"
labels:
OrgId: ["bank_02"]
effects:
- metric: "network_latency" # scales node-level network latency (ms)
op: "scale"
value: 5.0Run this scenario (one-liner):
cat > /tmp/network_latency_spike.yaml <<'YAML'
failure:
scenarios:
- name: "network_latency_spike"
start: "0s"
duration: "1m30s"
labels:
OrgId: ["bank_02"]
effects:
- metric: "network_latency"
op: "scale"
value: 5.0
YAML
TRANSACTION_RATE=50 ./bin/otel-fintrans-simulator --config /tmp/network_latency_spike.yaml --log-output stdout --rand-seed 12345- Redis/KeyDB memory bloat
- name: "redis_memory_bloat"
start: "20s"
duration: "90s"
labels:
OrgId: ["bank_03"]
effects:
- metric: "redis_memory"
op: "scale"
value: 2.0
- metric: "redis_evicted_keys_total"
op: "add"
value: 20Run this scenario (one-liner):
cat > /tmp/redis_memory_bloat.yaml <<'YAML'
failure:
scenarios:
- name: "redis_memory_bloat"
start: "0s"
duration: "90s"
labels:
OrgId: ["bank_03"]
effects:
- metric: "redis_memory"
op: "scale"
value: 2.0
- metric: "redis_evicted_keys_total"
op: "add"
value: 20
YAML
TRANSACTION_RATE=40 ./bin/otel-fintrans-simulator --config /tmp/redis_memory_bloat.yaml --log-output stdout --rand-seed 12345- Tomcat thread ramp & queue growth (load generation)
- name: "tomcat_thread_ramp"
start: "15s"
duration: "2m"
labels:
OrgId: ["bank_02", "bank_04"]
effects:
- metric: "tomcat_threads"
op: "ramp"
step: 2
- metric: "tomcat_threads_queue_seconds"
op: "add"
value: 5Run this scenario (one-liner):
cat > /tmp/tomcat_thread_ramp.yaml <<'YAML'
failure:
scenarios:
- name: "tomcat_thread_ramp"
start: "0s"
duration: "2m"
labels:
OrgId: ["bank_02", "bank_04"]
effects:
- metric: "tomcat_threads"
op: "ramp"
step: 2
- metric: "tomcat_threads_queue_seconds"
op: "add"
value: 5
YAML
TRANSACTION_RATE=100 ./bin/otel-fintrans-simulator --config /tmp/tomcat_thread_ramp.yaml --log-output stdout --rand-seed 12345- Kafka under-replicated partitions burst (controller-level instability)
- name: "kafka_underreplicated_burst"
start: "30s"
duration: "1m"
labels:
OrgId: ["bank_01", "bank_03"]
effects:
- metric: "kafka_controller_UnderReplicatedPartitions"
op: "add"
value: 5
- metric: "transaction_failures"
op: "scale"
value: 6.0Run this scenario (one-liner):
cat > /tmp/kafka_underreplicated_burst.yaml <<'YAML'
failure:
scenarios:
- name: "kafka_underreplicated_burst"
start: "0s"
duration: "60s"
labels:
OrgId: ["bank_01", "bank_03"]
effects:
- metric: "kafka_controller_UnderReplicatedPartitions"
op: "add"
value: 5
- metric: "transaction_failures"
op: "scale"
value: 6.0
YAML
TRANSACTION_RATE=50 ./bin/otel-fintrans-simulator --config /tmp/kafka_underreplicated_burst.yaml --log-output stdout --rand-seed 12345- Composite / partial outage (multi-system cascade)
- name: "partial_outage_bank03"
start: "60s"
duration: "2m"
labels:
OrgId: ["bank_03"]
effects:
- metric: "kafka_controller_UnderReplicatedPartitions"
op: "add"
value: 3
- metric: "db_latency"
op: "scale"
value: 6.0
- metric: "transaction_failures"
op: "scale"
value: 8.0Run this scenario (one-liner):
cat > /tmp/partial_outage_bank03.yaml <<'YAML'
failure:
scenarios:
- name: "partial_outage_bank03"
start: "0s"
duration: "2m"
labels:
OrgId: ["bank_03"]
effects:
- metric: "kafka_controller_UnderReplicatedPartitions"
op: "add"
value: 3
- metric: "db_latency"
op: "scale"
value: 6.0
- metric: "transaction_failures"
op: "scale"
value: 8.0
YAML
TRANSACTION_RATE=60 ./bin/otel-fintrans-simulator --config /tmp/partial_outage_bank03.yaml --log-output stdout --rand-seed 12345- Mixed signals (contradictory telemetry)
- name: "kafka_mixed_signals"
start: "0s"
duration: "90s"
labels:
OrgId: ["bank_01"]
effects:
- metric: "kafka_disk_failure"
op: "scale"
value: 4.0
- metric: "kafka_throughput"
op: "scale"
value: 0.2
# This produces more kafka errors and URP while reducing requests/throughput — a mixed signal patternRun this scenario (one-liner):
cat > /tmp/kafka_mixed_signals.yaml <<'YAML'
failure:
scenarios:
- name: "kafka_mixed_signals"
start: "0s"
duration: "90s"
labels:
OrgId: ["bank_01"]
effects:
- metric: "kafka_disk_failure"
op: "scale"
value: 4.0
- metric: "kafka_throughput"
op: "scale"
value: 0.2
YAML
TRANSACTION_RATE=50 ./bin/otel-fintrans-simulator --config /tmp/kafka_mixed_signals.yaml --log-output stdout --rand-seed 12345Tips & mapping notes
- Metric names are flexible — the scheduler accepts the common names listed above and applies them to the simulator runtime state. If your telemetry backend expects custom names, update
telemetry.metric_namesinsimulator-config.yaml. - Use
labelsto scope scenarios to specific OrgIds, OrgNames, or transaction types — this helps exercise RCA/correlation engines against targeted faults. scaleandaddare useful for magnitude changes;rampis useful for gradual increases over the scenario duration.
Now that you have concrete scenarios, your RCA and correlation engines can detect and classify whether failures originate in service deployments, infra (e.g., disks/memory), or network layers. Use the deterministic seed (failure.seed) to make demo runs reproducible for tests and demos.
I've added ready-to-run scenario configuration files under examples/scenarios/. Use examples/run_scenario.sh <name> to run a scenario quickly — the script will build the binary if necessary and run the simulator with sensible defaults.
Example:
# run the kafka_disk_issue scenario (from examples/scenarios/kafka_disk_issue.yaml)
examples/run_scenario.sh kafka_disk_issueBelow are the example scenario files included in examples/scenarios/. Each file is a full, standalone simulator YAML (contains telemetry, metric_names, labels and the failure section), ready to run as-is.
db_slow_cascade.yaml— database slowdown / deployment outage (increased db latency, more transaction failures)kafka_disk_issue.yaml— hardware disk/storage failure affecting Kafka (more produce/consume errors, higher URP)keydb_memory_corruption.yaml— bad memory for KeyDB / valkey (higher KeyDB failures and Redis noise)network_packet_loss.yaml— network packet drops (increased packet drops and noisy messaging errors)network_latency_spike.yaml— network latency spike (higher network latency across nodes, impacts consumer/producer latencies)redis_memory_bloat.yaml— Redis/KeyDB memory bloat and evictionstomcat_thread_ramp.yaml— Tomcat thread busy ramp and queue increases (load stress)kafka_underreplicated_burst.yaml— Kafka under-replication burst (controller instability)kafka_mixed_signals.yaml— mixed signal: more Kafka errors with reduced throughput (contradictory telemetry)partial_outage_bank03.yaml— composite partial outage targeting bank_03 (multi-system cascade)
Run any of them with the helper script, for example:
examples/run_scenario.sh kafka_mixed_signalsOr run directly using the binary and config path:
TRANSACTION_RATE=50 ./bin/otel-fintrans-simulator --config examples/scenarios/kafka_mixed_signals.yaml --log-output stdout# 1. Start the simulator
./bin/otel-fintrans-simulator &
# 2. Wait for telemetry to accumulate (30s)
sleep 30
# 3. Query correlation API with time window
curl -X POST http://localhost:8010/api/v1/unified/correlate \
-H "Content-Type: application/json" \
-d '{
"startTime": "2025-11-25T10:00:00Z",
"endTime": "2025-11-25T10:15:00Z"
}'The correlation engine will:
- Discover
transactions_failed_totalvia KPI registry. - Identify correlated service latency patterns
- Build service graph from observed trace telemetry
- Return correlation result with confidence scores
# Inject a specific failure
INJECT_DB_LATENCY=true ./bin/otel-fintrans-simulator &
# Query RCA for root cause analysis
curl -X POST http://localhost:8010/api/v1/unified/rca \
-H "Content-Type: application/json" \
-d '{
"startTime": "2025-11-25T10:05:00Z",
"endTime": "2025-11-25T10:10:00Z"
}'When modifying the simulator:
- Keep domain fidelity: Financial transaction vocabulary should remain realistic
- Don't pollute engines: Avoid copying simulator defaults into
internal/services/correlation_engine.go; engines should discover KPIs dynamically via the registry. - Update this README: Document any new metrics, service names, or configuration changes
- Test with real Mirador: Ensure correlation/RCA engines still work via registry discovery
Apache 2.0 (see top-level LICENSE file)
Below is a proposed, ordered list of follow-up improvements and small issues we plan to take up one-by-one. These are intentionally scoped so we can address them in small, reviewable PRs that improve developer experience and simulation fidelity.
- Make logging configurable and developer-friendly ✅
- Add a
--logor--log-outputflag that allowsstdout/nop(default) andotlpso developers can see console logs locally by default. (Implemented:--log-outputsupportsnopandstdout.) - Add an environment variable override (e.g.,
SIM_LOG_OUTPUT).
- Graceful shutdown and context cancellation ✅
- Add a signal handler (SIGINT/SIGTERM) and a cancellable context to allow the simulator to stop quickly and shut down OTel providers cleanly. (Implemented)
- Improve time-series scheduling and backfill behavior
- Tweak scheduling so that future/edge offsets don't block for long periods (consider a bounded scheduler and non-blocking generation for unreachable timestamps).
- Make telemetry configuration pluggable ✅
- Done:
simulator-config.yamladded and--configflag implemented. Seeconfig.go+ examples and unit tests (config_test.go).
- Fix metric instrument types & semantics ✅
- Replace
transaction_amount_paisa_count(previously an UpDownCounter) with a monotonicCounter(done).
6a. Module / developer setup
- Add a
go.modandgo.sum(module-aware layout) sogo test ./...and other Go tooling work for developers and CI. This helps keep dependency versions stable for reproducible test runs.
For local development we've added a convenient Makefile with a couple of helpful targets:
make build— builds the simulator binary intobin/.make test— runsgo test ./....make localdev-sim— builds and runs a small sample using default settings (targeting a local collector)
See Makefile at the project root.
- Add reproducibility and seeding ✅
- Add a
--rand-seedflag so tests and demo runs can generate reproducible streams. (Implemented)
- Improve failure-mode realism ✅
- Done: burst/scheduled failure support and deterministic seeding implemented. See
config.go,simulator-config.yamlandscheduler_test.gofor examples and tests.
- Add tests & CI automation for simulator build and basic runtime
- Small unit tests for metrics initialization and a quick smoke
go testthat validatesinitOTelcan be invoked in a test shim (using a fake collector or a no-op option).
We've added a lightweight GitHub Actions workflow .github/workflows/ci.yml that runs go test ./..., gofmt checks, and go vet on pushes and PRs to main.
The repository includes a manual release workflow you can trigger from the GitHub Actions UI (Actions → CI → Run workflow). The workflow builds a platform-tagged binary and (optionally) builds & pushes a container image to GHCR and DockerHub.
Required inputs when running the workflow manually:
release_tag(required) — the version that will be used for the release and image tag (for examplev1.0.0).push_image(optional, default true) — whether to build & push container images.platform(optional, defaultamd64) — architecture part of the image tag, used for GOARCH and image tag.dockerhub_repository(optional) —owner/repoon Docker Hub if you want the workflow to push to Docker Hub in addition to GHCR.
Secrets / permissions
- GHCR: the workflow uses the repository's
GITHUB_TOKENto push container images to GitHub Container Registry. The workflow requestspackages: writepermission. - DockerHub: if you want the CI to push images to Docker Hub, set these repository secrets:
DOCKERHUB_USERNAMEDOCKERHUB_TOKEN(or personal access token)
Image / asset naming
- Container image tag format:
{release_tag}-{platform}(e.g.v1.0.0-amd64) — pushed asghcr.io/<owner>/<repo>:<tag>and, if enabled,docker.io/<dockerhub_repository>:<tag>. - Release binary asset name:
otel-fintrans-simulator-{release_tag}-{platform}(uploaded to the GitHub Release created by the workflow).
NOTE: The manual release workflow is currently restricted to the repository owner when invoked from the UI; if you want to allow additional actors or enable automatic tag-driven releases, I can update the workflow accordingly.
- Add a convenient local-dev Make target
make localdev-simormake run-simulatorthat builds and runs the simulator against a local collector or logs to stdout for quick demos.
- Document example
otel-collectorpipeline and how to point the simulator at it
- Add clear examples showing collector config for receivers/exporters so new developers can get telemetry into Observability tooling quickly.
There's a minimal OTEL Collector pipeline example in examples/otel-collector-pipeline.yaml that accepts OTLP and writes logs (useful for local testing). Point the simulator at your collector using OTEL_EXPORTER_OTLP_ENDPOINT or --otlp-endpoint.
- Use clear, dedicated metrics for each subsystem's latency (e.g.,
kafka_produce_latency_seconds,kafka_consume_latency_seconds) instead of recording Kafka latency intodb_latency_secondsto avoid conflating database and messaging latencies. (Implemented in code: kafka latency histograms are registered and used.) - Ensure key metric instruments include useful attributes such as
service_name,messaging.destination(topic), anddb_systemwhere appropriate to make traces/metrics more useful for RCA and correlation.
To support PromQL queries that use histogram_quantile() the simulator emits Prom-style histogram counters for key latency metrics in addition to OpenTelemetry histograms. For example transaction_latency_seconds has these exported counters:
transaction_latency_seconds_bucket{le="..."}(monotonic cumulative buckets)transaction_latency_seconds_sum(monotonic sum of latencies)transaction_latency_seconds_count(monotonic count)
Same applies for db_latency_seconds and other configured histogram metrics. The simulator emits these bucketed counters with the same attributes and labels you configure (e.g., service_name, OrgId) so PromQL queries like histogram_quantile(0.95, sum(rate(transaction_latency_seconds_bucket[5m])) by (le)) will return meaningful values.
The simulator can emit attributes (labels) on metrics and traces — configured under telemetry.labels in simulator-config.yaml. Typical label sets include OrgId, OrgName, and transaction_type. Keep label cardinality small (5-25 unique values) to avoid large memory and ingestion costs in backends.
Example label config in simulator-config.yaml:
telemetry:
labels:
org_ids: ["bank_01","bank_02","bank_03"]
org_names: ["BankOne","BankTwo","BankThree"]
transaction_types: ["merchant_payment","p2p","bill_payment"]We'll take these up one at a time — the already completed items (4 and 7) are marked above. Tell me which remaining item you'd like me to start next and I’ll open a focused PR/branch for it.