Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CDCSDK] CDC Long Retention: Severe Throughput Drop, Latency Spike, and Memory Overhead in PostgreSQL Connector After 6-Hour Downtime with 24-Hour CDC Retention (PG Connector [new YBDB connector]) #24845

Open
1 task done
shamanthchandra-yb opened this issue Nov 8, 2024 · 1 comment
Assignees
Labels
area/cdcsdk CDC SDK kind/bug This issue is a bug priority/high High Priority

Comments

@shamanthchandra-yb
Copy link

shamanthchandra-yb commented Nov 8, 2024

Jira Link: DB-13960

Description

In a controlled experiment testing the CDC long retention feature, the PostgreSQL connector was redeployed after 6 hours of downtime with a 24-hour retention setting. Observations post-deployment indicate a substantial performance degradation.

The following issues were noted:

Reduced Message Emission Post-Restart:
Upon redeploying the connector, the message emitted rate was significantly lower than the initial steady-state rate observed under normal conditions. Ideally, it will never catch up, worse, that lag will continue increasing and not even will remain constant under continuous running workload.

Throughput Drop:
Immediately after the connector was restarted, throughput dropped sharply from 3,800 ops/sec to 38 ops/sec.

Latency Spike:
Latency increased from an initial 1 ms to several hundred milliseconds, reaching up to 600 ms at peak.

Disk I/O Analysis:
Disk throughput and IOPS were analysed and ruled out as potential bottlenecks during this period.

Comparison with gRPC Connector:
Although throughput drop was also observed in the gRPC connector, it was not as severe as with the PostgreSQL connector. The gRPC connector issue is being tracked separately #24534. Please note, for postgres connector we have even reduced workload compared to gRPC and ensured in happy path there will be nearly no lag.

Increased Memory Consumption:
Post-redeployment, memory consumption increased to 3.15 GB, attributed to per-tablet overhead under untracked memory.

Below screenshots are for reference which confirm the above observations:

Msgs emitted and lag graph:

Screenshot 2024-11-08 at 5 49 26 PM

Perf degradation:

Screenshot 2024-11-08 at 5 54 07 PM

Disk is not bottleneck:
Screenshot 2024-11-08 at 5 55 07 PM

Tserver metrics:
Screenshot 2024-11-08 at 5 56 09 PM

docdb:
Screenshot 2024-11-08 at 5 56 24 PM

Memory graphs:
Screenshot 2024-11-08 at 6 19 40 PM
Screenshot 2024-11-08 at 6 18 20 PM

Source connector version

quay.io/yugabyte/ybdb-debezium:dz.2.5.2.yb.2024.1

Connector configuration

{"name":"ybconnector_cdc_4cd5f0_test_cdc_653cca","config":{"connector.class":"io.debezium.connector.postgresql.YugabyteDBConnector","topic.creation.default.partitions":"2","slot.name":"rs_cdc_4cd5f0_6876ec_from_con_d8cd","tasks.max":"5","publication.name":"pn_ybconnector_cdc_4cd5f0_test_cdc_653cca","max.connector.retries":"10","database.masterport":"7100","topic.prefix":"ybconnector_cdc_4cd5f0_test_cdc_653cca","operation.timeout.ms":"600000","socket.read.timeout.ms":"300000","topic.creation.default.replication.factor":"1","publication.autocreate.mode":"filtered","admin.operation.timeout.ms":"600000","database.user":"yugabyte","database.dbname":"cdc_4cd5f0","topic.creation.default.compression.type":"lz4","topic.creation.default.cleanup.policy":"delete","database.port":"5433","plugin.name":"yboutput","database.master.addresses":"172.151.20.142:7100,172.151.30.116:7100,172.151.29.95:7100","database.hostname":"172.151.20.142:5433,172.151.30.116:5433,172.151.29.95:5433","database.password":"yugabyte","name":"ybconnector_cdc_4cd5f0_test_cdc_653cca","database.sslrootcert":"/kafka/ca.crt","table.include.list":"public.test_cdc_653cca","database.masterhost":"172.151.29.95","snapshot.mode":"never"},"tasks":[{"connector":"ybconnector_cdc_4cd5f0_test_cdc_653cca","task":0}],"type":"source"}

YugabyteDB version

2024.2.0.0-b116

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@shamanthchandra-yb shamanthchandra-yb added priority/high High Priority area/cdcsdk CDC SDK status/awaiting-triage Issue awaiting triage labels Nov 8, 2024
@yugabyte-ci yugabyte-ci added the kind/bug This issue is a bug label Nov 8, 2024
@shamanthchandra-yb
Copy link
Author

I just realised I had raised another issue #24837 for a 22-hour connector downtime scenario. Observations were similar, but there’s a key difference: in that case, after the initial throughput drop, it recovered after a few hours. This was likely due to the CDC stream expiring, as message emission was significantly lower after redeployment, effectively stopping CDC, which allowed throughput to return to normal. Though, metrics and resources doesn't confirm it was expired. Basically, somehow, CDC wasn't working and thus throughput resumed there.

In the current scenario (6-hour downtime with 24-hour retention), throughput remains close to zero, indicating a different behaviour due to the shorter downtime and active retention period.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cdcsdk CDC SDK kind/bug This issue is a bug priority/high High Priority
Projects
None yet
Development

No branches or pull requests

3 participants