pg-chaos

Tips and scripts to test the resilliency of a Postgres service.
As we are intentially trying to crash the service, for the purpsoes of testing, use the lightest machine possible.

There's no need to wait for a TB of disk to fill, if you can emulate with a GB.

Disk Full

Failure Scenario:
Insufficient disk space prevents PostgreSQL from writing new data or creating temporary files.

Emulation:

disk_full.sql
- Using generate_series()
- 30M rows generates ~1GB table
- Tables can be TEMP or persisted
pgbench
- Executing pgbench --initialize --quiet --scale 100 --unlogged -I dtg --fillfactor 10 $DB
- 1 scaling factor = 100k rows
- Calucate size via formula
- Explicitly set the data generation steps

Aiven Handling:
Service will move to read-only at the high-watermark.
The serivice can be returned to read-write for maintenance.

[postgresql-16][29-1] pid=82,user=,db=,app=,client= LOG: parameter "default_transaction_read_only" changed to "on"
...
[postgresql-16][31-1] pid=82,user=,db=,app=,client= LOG: parameter "default_transaction_read_only" changed to "off"

Out of Memory (OOM)

Failure Scenario:
Queries or background processes consume excessive memory, causing the OS to terminate PostgreSQL.

Emulation:

out_of_memory.sql
- work_mem:
  Maximize work_mem, ensuring PostgreSQL keeps intermediate results in memory.
- WITH RECURSIVE:
  Creates a recursive CTE that builds a large in-memory array structure. Each iteration appends to the data array, consuming more memory.
- pg_sleep(5):
  Delays the final result, allowing you to monitor the memory usage in real-time.
- COUNT(*): Ensures the recursive CTE is fully evaluated.

Aiven Handling:
Early OOM Detection will kill the process in the container before the OS hits OOM Killer.
See: Out of Memory Conditions

earlyoom: low memory! at or below SIGTERM limits: mem 10.00%, swap  4.00%                                  
earlyoom: mem avail:    31 of  1960 MiB ( 1.63%), swap free:   94 of 2455 MiB ( 3.84%)
earlyoom: escalating to SIGKILL after 0.5 seconds

postgresql-16: [32-1] pid=82,user=,db=,app=,client= LOG:  server process (PID 64684) was terminated by signal 9: Killed

High CPU Usage

Failure Scenario: CPU-intensive queries or other Emulationes starve PostgreSQL of CPU resources.

Emulation:
Queries can be executed in parallel, by multiple clients, for a computational high intesity.

high_put.sql
- max_parallel_workers_per_gather:
  Configures the number of workers PostgreSQL can use for parallel execution.
- generate_series(1, 1e8):
  Generates a large series of numbers to process. Adjust the upper limit (1e8) to control the workload. Can be processed in parallel, distributing the workload across CPU cores.
- SQRT and LOG:
  Apply computationally expensive mathematical functions to each value in the series.
- SUM:
  Aggregates the results, further increasing CPU utilization.

Aiven Handling:
The service exists in a container, with a constrained CPU quota of 90%, ensuring there is always overhead remaining for platform orchestration.

Warning: Load15 on service is at least 5 times the cpu count
Critical: Load15 on service is at least 15 times the cpu count

Network Partition (existing sabotage)

Failure Scenario:
A node loses connectivity to other nodes or clients.

Emulation:
Use iptables to block or drop network traffic:
iptables -A INPUT -p tcp --dport 5432 -j DROP

Aiven Handling:
When a server unexpectedly disconnects, there is no certain way to know whether it really disappeared or whether there is a temporary glitch in the cloud provider's network. Aiven's management platform has different procedures in case of primary or replica nodes disconnections.

See: Uncontrolled primary/replica disconnection

Postmaster Crash

Failure Scenario:
Postmaster or child crashes due to a bug, misconfiguration, user termination or out-of-memory condition.

Emulation:
Kill the Postgres service: kill -9 $(pgrep postgres).

Aiven Handling:
We will restart the container process.
Note that node monitor for RUNNING state and pglookout still apply.
A node with service that does not return to expected state within thresholds will be replaced.

postgresql-16: [36-1] pid=64817,user=postgres,db=_aiven,app=[unknown],client=[local] FATAL:  the database system is in recovery mode

InstanceManager       InstanceManager ERROR     Unexpected exception: Failed check, took 6.12s. Sleeping for 2.00s.

postgresql-16: [35-1] pid=64814,user=,db=,app=,client= LOG:  checkpoint starting: end-of-recovery immediate wait

postgresql-16: [35-1] pid=82,user=,db=,app=,client= LOG:  database system is ready to accept connections

Logical Replication or Streaming Replication Breakage

Failure Scenario:
Replication fails due to missing WAL files or network issues.

Emulation:

Replication Lag
- Logical Replication:
  - ALTER SUBSCRIPTION <subscription_name> DISABLE;
- Streaming Replication:
  - Using pg_hba.conf, comment out or modify the replication entry for the replica's IP.
WAL Corruption
- Identify and remove a WAL pending synchronization:
- SELECT * FROM pg_stat_replication;
- rm -f /path/to/primary/data/pg_wal/<wal_file_name>

Aiven Handling:
pglookout is a PostgreSQL® replication monitoring and failover daemon. pglookout monitors PG database nodes and their replication status and acts according to that status, for example calling a predefined failover command to promote a new primary in case the previous one goes missing.

The max_failover_replication_time_lag setting monitors lag after which failover_command will be executed.

PGLookout, determines which of the standby nodes is the furthest along in replication (has the least potential for data loss) and does a controlled failover to that node.

File Corruption

Failure Scenario:
Relational data files become corrupted due to disk issues or faulty hardware.

Emulation:
Manually edit or delete files in the data directory.

Aiven Handling:
Errors will appear in postgres when attempting to read the file.

Example Errors:

WARNING: page verification failed, calculated checksum 20919 but expected 15254

ERROR: invalid page in block 4565901 of relation base/16427/837074

FATAL: relation mapping file "base/16427/pg_filenode.map" contains invalid data

The relfiles follow a path based on the oid for the database in pg_database...

defaultdb=> select oid,datname from pg_database where oid = '16427';
  oid  |  datname
-------+-----------
 16427 | defaultdb
(1 row)

And the relfilenode for the object in pg_class:

defaultdb=> select oid, relname, relfilenode
from pg_class where relfilenode = '837074';
  oid   |     relname      | relfilenode
--------+------------------+-------------
 837074 | pgbench_accounts |      837074
(1 row)

In most instances, it will be advisable to fail over to a replica.
However, depending on the situation, it may also be possible to:

DROP the object
REINDEX a corrupted index
VACUUM damaged pages
Use PITR and export/import the corrupted data range

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
terraform		terraform
README.md		README.md
disk_full.sql		disk_full.sql
high_cpu.sql		high_cpu.sql
out_of_memory.sql		out_of_memory.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pg-chaos

Disk Full

Out of Memory (OOM)

High CPU Usage

Network Partition (existing sabotage)

Postmaster Crash

Logical Replication or Streaming Replication Breakage

File Corruption

About

Releases

Packages

Languages

brendan-stephens/pg-chaos

Folders and files

Latest commit

History

Repository files navigation

pg-chaos

Disk Full

Out of Memory (OOM)

High CPU Usage

Network Partition (existing sabotage)

Postmaster Crash

Logical Replication or Streaming Replication Breakage

File Corruption

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages