[CDCSDK] CDC Long Retention: Severe Throughput Drop, Latency Spike, and Memory Overhead in PostgreSQL Connector After 6-Hour Downtime with 24-Hour CDC Retention (PG Connector [new YBDB connector]) #24845

shamanthchandra-yb · 2024-11-08T13:06:10Z

Description

In a controlled experiment testing the CDC long retention feature, the PostgreSQL connector was redeployed after 6 hours of downtime with a 24-hour retention setting. Observations post-deployment indicate a substantial performance degradation.

The following issues were noted:

Reduced Message Emission Post-Restart:
Upon redeploying the connector, the message emitted rate was significantly lower than the initial steady-state rate observed under normal conditions. Ideally, it will never catch up, worse, that lag will continue increasing and not even will remain constant under continuous running workload.

Throughput Drop:
Immediately after the connector was restarted, throughput dropped sharply from 3,800 ops/sec to 38 ops/sec.

Latency Spike:
Latency increased from an initial 1 ms to several hundred milliseconds, reaching up to 600 ms at peak.

Disk I/O Analysis:
Disk throughput and IOPS were analysed and ruled out as potential bottlenecks during this period.

Comparison with gRPC Connector:
Although throughput drop was also observed in the gRPC connector, it was not as severe as with the PostgreSQL connector. The gRPC connector issue is being tracked separately #24534. Please note, for postgres connector we have even reduced workload compared to gRPC and ensured in happy path there will be nearly no lag.

Increased Memory Consumption:
Post-redeployment, memory consumption increased to 3.15 GB, attributed to per-tablet overhead under untracked memory.

Below screenshots are for reference which confirm the above observations:

Msgs emitted and lag graph:

Perf degradation:

Disk is not bottleneck:

Tserver metrics:

docdb:

Memory graphs:

Source connector version

quay.io/yugabyte/ybdb-debezium:dz.2.5.2.yb.2024.1

Connector configuration

{"name":"ybconnector_cdc_4cd5f0_test_cdc_653cca","config":{"connector.class":"io.debezium.connector.postgresql.YugabyteDBConnector","topic.creation.default.partitions":"2","slot.name":"rs_cdc_4cd5f0_6876ec_from_con_d8cd","tasks.max":"5","publication.name":"pn_ybconnector_cdc_4cd5f0_test_cdc_653cca","max.connector.retries":"10","database.masterport":"7100","topic.prefix":"ybconnector_cdc_4cd5f0_test_cdc_653cca","operation.timeout.ms":"600000","socket.read.timeout.ms":"300000","topic.creation.default.replication.factor":"1","publication.autocreate.mode":"filtered","admin.operation.timeout.ms":"600000","database.user":"yugabyte","database.dbname":"cdc_4cd5f0","topic.creation.default.compression.type":"lz4","topic.creation.default.cleanup.policy":"delete","database.port":"5433","plugin.name":"yboutput","database.master.addresses":"172.151.20.142:7100,172.151.30.116:7100,172.151.29.95:7100","database.hostname":"172.151.20.142:5433,172.151.30.116:5433,172.151.29.95:5433","database.password":"yugabyte","name":"ybconnector_cdc_4cd5f0_test_cdc_653cca","database.sslrootcert":"/kafka/ca.crt","table.include.list":"public.test_cdc_653cca","database.masterhost":"172.151.29.95","snapshot.mode":"never"},"tasks":[{"connector":"ybconnector_cdc_4cd5f0_test_cdc_653cca","task":0}],"type":"source"}

YugabyteDB version

2024.2.0.0-b116

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

The text was updated successfully, but these errors were encountered:

shamanthchandra-yb · 2024-11-08T13:31:07Z

I just realised I had raised another issue #24837 for a 22-hour connector downtime scenario. Observations were similar, but there’s a key difference: in that case, after the initial throughput drop, it recovered after a few hours. This was likely due to the CDC stream expiring, as message emission was significantly lower after redeployment, effectively stopping CDC, which allowed throughput to return to normal. Though, metrics and resources doesn't confirm it was expired. Basically, somehow, CDC wasn't working and thus throughput resumed there.

In the current scenario (6-hour downtime with 24-hour retention), throughput remains close to zero, indicating a different behaviour due to the shorter downtime and active retention period.

shamanthchandra-yb added priority/high High Priority area/cdcsdk CDC SDK status/awaiting-triage Issue awaiting triage labels Nov 8, 2024

shamanthchandra-yb assigned suranjan Nov 8, 2024

yugabyte-ci added the kind/bug This issue is a bug label Nov 8, 2024

shamanthchandra-yb mentioned this issue Nov 8, 2024

[CDCSDK] CDC Long Retention: PG connector + 22 hrs downtime experiment observations and issues #24837

Closed

1 task

yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CDCSDK] CDC Long Retention: Severe Throughput Drop, Latency Spike, and Memory Overhead in PostgreSQL Connector After 6-Hour Downtime with 24-Hour CDC Retention (PG Connector [new YBDB connector]) #24845

[CDCSDK] CDC Long Retention: Severe Throughput Drop, Latency Spike, and Memory Overhead in PostgreSQL Connector After 6-Hour Downtime with 24-Hour CDC Retention (PG Connector [new YBDB connector]) #24845

shamanthchandra-yb commented Nov 8, 2024 •

edited

Loading

shamanthchandra-yb commented Nov 8, 2024

[CDCSDK] CDC Long Retention: Severe Throughput Drop, Latency Spike, and Memory Overhead in PostgreSQL Connector After 6-Hour Downtime with 24-Hour CDC Retention (PG Connector [new YBDB connector]) #24845

[CDCSDK] CDC Long Retention: Severe Throughput Drop, Latency Spike, and Memory Overhead in PostgreSQL Connector After 6-Hour Downtime with 24-Hour CDC Retention (PG Connector [new YBDB connector]) #24845

Comments

shamanthchandra-yb commented Nov 8, 2024 • edited Loading

Description

Source connector version

Connector configuration

YugabyteDB version

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

shamanthchandra-yb commented Nov 8, 2024

shamanthchandra-yb commented Nov 8, 2024 •

edited

Loading