Tips and scripts to test the resilliency of a Postgres service.
As we are intentially trying to crash the service, for the purpsoes of testing, use the lightest machine possible.
There's no need to wait for a TB of disk to fill, if you can emulate with a GB.
Failure Scenario:
Insufficient disk space prevents PostgreSQL from writing new data or creating temporary files.
Emulation:
- disk_full.sql
- Using
generate_series()
- 30M rows generates ~1GB table
- Tables can be
TEMP
or persisted
- Using
- pgbench
Aiven Handling:
Service will move to read-only at the high-watermark.
The serivice can be returned to read-write for maintenance.
[postgresql-16][29-1] pid=82,user=,db=,app=,client= LOG: parameter "default_transaction_read_only" changed to "on"
...
[postgresql-16][31-1] pid=82,user=,db=,app=,client= LOG: parameter "default_transaction_read_only" changed to "off"
Failure Scenario:
Queries or background processes consume excessive memory, causing the OS to terminate PostgreSQL.
Emulation:
- out_of_memory.sql
- work_mem:
Maximizework_mem
, ensuring PostgreSQL keeps intermediate results in memory. - WITH RECURSIVE:
Creates a recursive CTE that builds a large in-memory array structure. Each iteration appends to the data array, consuming more memory. - pg_sleep(5):
Delays the final result, allowing you to monitor the memory usage in real-time. - COUNT(*): Ensures the recursive CTE is fully evaluated.
- work_mem:
Aiven Handling:
Early OOM Detection will kill the process in the container before the OS hits OOM Killer.
See: Out of Memory Conditions
earlyoom: low memory! at or below SIGTERM limits: mem 10.00%, swap 4.00%
earlyoom: mem avail: 31 of 1960 MiB ( 1.63%), swap free: 94 of 2455 MiB ( 3.84%)
earlyoom: escalating to SIGKILL after 0.5 seconds
postgresql-16: [32-1] pid=82,user=,db=,app=,client= LOG: server process (PID 64684) was terminated by signal 9: Killed
Failure Scenario: CPU-intensive queries or other Emulationes starve PostgreSQL of CPU resources.
Emulation:
Queries can be executed in parallel, by multiple clients, for a computational high intesity.
- high_put.sql
- max_parallel_workers_per_gather:
Configures the number of workers PostgreSQL can use for parallel execution. - generate_series(1, 1e8):
Generates a large series of numbers to process. Adjust the upper limit (1e8) to control the workload. Can be processed in parallel, distributing the workload across CPU cores. - SQRT and LOG:
Apply computationally expensive mathematical functions to each value in the series. - SUM:
Aggregates the results, further increasing CPU utilization.
- max_parallel_workers_per_gather:
Aiven Handling:
The service exists in a container, with a constrained CPU quota of 90%
, ensuring there is always overhead remaining for platform orchestration.
- Warning:
Load15 on service is at least 5 times the cpu count
- Critical:
Load15 on service is at least 15 times the cpu count
Failure Scenario:
A node loses connectivity to other nodes or clients.
Emulation:
Use iptables to block or drop network traffic:
iptables -A INPUT -p tcp --dport 5432 -j DROP
Aiven Handling:
When a server unexpectedly disconnects, there is no certain way to know whether it really disappeared or whether there is a temporary glitch in the cloud provider's network. Aiven's management platform has different procedures in case of primary or replica nodes disconnections.
See: Uncontrolled primary/replica disconnection
Failure Scenario:
Postmaster or child crashes due to a bug, misconfiguration, user termination or out-of-memory condition.
Emulation:
Kill the Postgres service: kill -9 $(pgrep postgres)
.
Aiven Handling:
We will restart the container process.
Note that node monitor for RUNNING
state and pglookout
still apply.
A node with service that does not return to expected state within thresholds will be replaced.
postgresql-16: [36-1] pid=64817,user=postgres,db=_aiven,app=[unknown],client=[local] FATAL: the database system is in recovery mode
InstanceManager InstanceManager ERROR Unexpected exception: Failed check, took 6.12s. Sleeping for 2.00s.
postgresql-16: [35-1] pid=64814,user=,db=,app=,client= LOG: checkpoint starting: end-of-recovery immediate wait
postgresql-16: [35-1] pid=82,user=,db=,app=,client= LOG: database system is ready to accept connections
Failure Scenario:
Replication fails due to missing WAL files or network issues.
Emulation:
- Replication Lag
- Logical Replication:
ALTER SUBSCRIPTION <subscription_name> DISABLE;
- Streaming Replication:
- Using
pg_hba.conf
, comment out or modify the replication entry for the replica's IP.
- Using
- Logical Replication:
- WAL Corruption
- Identify and remove a WAL pending synchronization:
SELECT * FROM pg_stat_replication;
rm -f /path/to/primary/data/pg_wal/<wal_file_name>
Aiven Handling:
pglookout is a PostgreSQL® replication monitoring and failover daemon. pglookout monitors PG database nodes and their replication status and acts according to that status, for example calling a predefined failover command to promote a new primary in case the previous one goes missing.
The max_failover_replication_time_lag
setting monitors lag after which failover_command
will be executed.
PGLookout, determines which of the standby nodes is the furthest along in replication (has the least potential for data loss) and does a controlled failover to that node.
Failure Scenario:
Relational data files become corrupted due to disk issues or faulty hardware.
Emulation:
Manually edit or delete files in the data directory.
Aiven Handling:
Errors will appear in postgres when attempting to read the file.
Example Errors:
WARNING: page verification failed, calculated checksum 20919 but expected 15254
ERROR: invalid page in block 4565901 of relation base/16427/837074
FATAL: relation mapping file "base/16427/pg_filenode.map" contains invalid data
The relfiles follow a path based on the oid
for the database in pg_database
...
defaultdb=> select oid,datname from pg_database where oid = '16427';
oid | datname
-------+-----------
16427 | defaultdb
(1 row)
And the relfilenode
for the object in pg_class
:
defaultdb=> select oid, relname, relfilenode
from pg_class where relfilenode = '837074';
oid | relname | relfilenode
--------+------------------+-------------
837074 | pgbench_accounts | 837074
(1 row)
In most instances, it will be advisable to fail over to a replica.
However, depending on the situation, it may also be possible to:
DROP
the objectREINDEX
a corrupted indexVACUUM
damaged pages- Use PITR and export/import the corrupted data range