Prerequisites
Please answer the following questions for yourself before submitting an issue. YOU MAY DELETE THE PREREQUISITES SECTION.
Expected Behavior
lock-and-fetch method can work with large set of schedules
Current Behavior
We see big performance degradation, when there is a task scheduling storm - means having 100k tasks scheduled within 5seconds, the scheduler's lockAndFetch method will try to do below SQL,
UPDATE xxx_scheduled_tasks st1
SET picked = true, picked_by = 'abc', last_heartbeat = '20000123T012345.678+0900', version = version + 1
WHERE (st1.task_name, st1.task_instance) IN (
SELECT st2.task_name, st2.task_instance FROM xxx_scheduled_tasks st2
WHERE picked = false and execution_time <= '20000123T012345.678+0900' order by execution_time asc LIMIT 20 FOR UPDATE SKIP LOCKED)
RETURNING st1.* ;
In our case, there are 5 instances running this SQL. Unfortunately, Postgres seems always do "LockRow" before the "Limit", below is a sample SQL execution plan,
explain
UPDATE xxx_scheduled_tasks st1
SET picked = true, picked_by = 'abc', last_heartbeat = '20000123T012345.678+0900', version = version + 1
WHERE (st1.task_name, st1.task_instance) IN (
SELECT st2.task_name, st2.task_instance FROM xxx_scheduled_tasks st2
WHERE picked = false and execution_time <= '20000123T012345.678+0900' order by execution_time asc LIMIT 20 FOR UPDATE SKIP LOCKED)
RETURNING st1.* ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Update on xxx_scheduled_tasks st1 (cost=248.34..254.38 rows=1 width=217)
-> Nested Loop (cost=248.34..254.38 rows=1 width=217)
-> HashAggregate (cost=144.08..144.09 rows=1 width=94)
Group Key: "ANY_subquery".task_name, "ANY_subquery".task_instance
-> Subquery Scan on "ANY_subquery" (cost=144.05..144.07 rows=1 width=94)
-> Limit (cost=144.05..144.06 rows=1 width=49)
-> LockRows (cost=144.05..144.06 rows=1 width=49)
-> Sort (cost=144.05..144.05 rows=1 width=49)
Sort Key: st2.execution_time
-> Seq Scan on xxx_scheduled_tasks st2 (cost=0.00..144.04 rows=1 width=49)
Filter: ((NOT picked) AND (execution_time <= '2000-01-22 16:23:45.678+00'::timestamp with time zone))
-> Bitmap Heap Scan on xxx_scheduled_tasks st1 (cost=104.27..108.28 rows=1 width=117)
Recheck Cond: ((task_name = "ANY_subquery".task_name) AND (task_instance = "ANY_subquery".task_instance))
-> Bitmap Index Scan on xxx_scheduled_tasks_pkey (cost=0.00..104.27 rows=1 width=0)
Index Cond: ((task_name = "ANY_subquery".task_name) AND (task_instance = "ANY_subquery".task_instance))
Looking at the plan, it is suspected that the LockRows is before the Limit operation, does this mean that when target rows is a big number it will lock all rows in first place before "Limit" ?
With our case, we have 130k tasks (past due) in DB, the query above took 1hour to complete.. If this is a performance limitation for lock-and-fetch case, maybe it is not caused by DB Scheduler but by PostgreSQL's query planning engine, shall we document this, so people can use this with caution ?
PS: when we switched back to "fetch" polling policy, this problem went away.
For bug reports
YOU MAY DELETE THE For bug reports SECTION IF A NEW FEATURE REQUEST.
Steps to Reproduce the bug
- Run 5 instances of DB scheduler (with SpringBoot)
- Schedule 100k + tasks within 2 or 3 seconds
- Observer the performance of DB Scheduler, we see the query may take 1hour to respond.
Context
- DB-Scheduler Version : 11.7
- Java Version : Java 11
- Spring Boot (check for Yes) : [x]
- Database and Version : PG 13.10
Logs
Prerequisites
Please answer the following questions for yourself before submitting an issue. YOU MAY DELETE THE PREREQUISITES SECTION.
Expected Behavior
lock-and-fetchmethod can work with large set of schedulesCurrent Behavior
We see big performance degradation, when there is a task scheduling storm - means having 100k tasks scheduled within 5seconds, the scheduler's
lockAndFetchmethod will try to do below SQL,In our case, there are 5 instances running this SQL. Unfortunately, Postgres seems always do "LockRow" before the "Limit", below is a sample SQL execution plan,
Looking at the plan, it is suspected that the
LockRowsis before theLimitoperation, does this mean that when target rows is a big number it will lock all rows in first place before"Limit"?With our case, we have 130k tasks (past due) in DB, the query above took 1hour to complete.. If this is a performance limitation for
lock-and-fetchcase, maybe it is not caused by DB Scheduler but by PostgreSQL's query planning engine, shall we document this, so people can use this with caution ?PS: when we switched back to
"fetch"polling policy, this problem went away.For bug reports
YOU MAY DELETE THE
For bug reportsSECTION IF A NEW FEATURE REQUEST.Steps to Reproduce the bug
Context
Logs