You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[CDCSDK] CDC Long Retention: Severe Throughput Drop, Latency Spike, and Memory Overhead in PostgreSQL Connector After 6-Hour Downtime with 24-Hour CDC Retention (PG Connector [new YBDB connector])
#24845
In a controlled experiment testing the CDC long retention feature, the PostgreSQL connector was redeployed after 6 hours of downtime with a 24-hour retention setting. Observations post-deployment indicate a substantial performance degradation.
The following issues were noted:
Reduced Message Emission Post-Restart:
Upon redeploying the connector, the message emitted rate was significantly lower than the initial steady-state rate observed under normal conditions. Ideally, it will never catch up, worse, that lag will continue increasing and not even will remain constant under continuous running workload.
Throughput Drop:
Immediately after the connector was restarted, throughput dropped sharply from 3,800 ops/sec to 38 ops/sec.
Latency Spike:
Latency increased from an initial 1 ms to several hundred milliseconds, reaching up to 600 ms at peak.
Disk I/O Analysis:
Disk throughput and IOPS were analysed and ruled out as potential bottlenecks during this period.
Comparison with gRPC Connector:
Although throughput drop was also observed in the gRPC connector, it was not as severe as with the PostgreSQL connector. The gRPC connector issue is being tracked separately #24534. Please note, for postgres connector we have even reduced workload compared to gRPC and ensured in happy path there will be nearly no lag.
Increased Memory Consumption:
Post-redeployment, memory consumption increased to 3.15 GB, attributed to per-tablet overhead under untracked memory.
Below screenshots are for reference which confirm the above observations:
I just realised I had raised another issue #24837 for a 22-hour connector downtime scenario. Observations were similar, but there’s a key difference: in that case, after the initial throughput drop, it recovered after a few hours. This was likely due to the CDC stream expiring, as message emission was significantly lower after redeployment, effectively stopping CDC, which allowed throughput to return to normal. Though, metrics and resources doesn't confirm it was expired. Basically, somehow, CDC wasn't working and thus throughput resumed there.
In the current scenario (6-hour downtime with 24-hour retention), throughput remains close to zero, indicating a different behaviour due to the shorter downtime and active retention period.
Jira Link: DB-13960
Description
In a controlled experiment testing the CDC long retention feature, the PostgreSQL connector was redeployed after 6 hours of downtime with a 24-hour retention setting. Observations post-deployment indicate a substantial performance degradation.
The following issues were noted:
Reduced Message Emission Post-Restart:
Upon redeploying the connector, the message emitted rate was significantly lower than the initial steady-state rate observed under normal conditions. Ideally, it will never catch up, worse, that lag will continue increasing and not even will remain constant under continuous running workload.
Throughput Drop:
Immediately after the connector was restarted, throughput dropped sharply from 3,800 ops/sec to 38 ops/sec.
Latency Spike:
Latency increased from an initial 1 ms to several hundred milliseconds, reaching up to 600 ms at peak.
Disk I/O Analysis:
Disk throughput and IOPS were analysed and ruled out as potential bottlenecks during this period.
Comparison with gRPC Connector:
Although throughput drop was also observed in the gRPC connector, it was not as severe as with the PostgreSQL connector. The gRPC connector issue is being tracked separately #24534. Please note, for postgres connector we have even reduced workload compared to gRPC and ensured in happy path there will be nearly no lag.
Increased Memory Consumption:
Post-redeployment, memory consumption increased to 3.15 GB, attributed to per-tablet overhead under untracked memory.
Below screenshots are for reference which confirm the above observations:
Msgs emitted and lag graph:
Perf degradation:
Disk is not bottleneck:
Tserver metrics:
docdb:
Memory graphs:
Source connector version
quay.io/yugabyte/ybdb-debezium:dz.2.5.2.yb.2024.1
Connector configuration
YugabyteDB version
2024.2.0.0-b116
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: