Description
Prerequisites
Please answer the following questions for yourself before submitting an issue. YOU MAY DELETE THE PREREQUISITES SECTION.
- I am running the latest version
- I checked the documentation and found no answer
- I checked to make sure that this issue has not already been filed
Expected Behavior
lock-and-fetch
method can work with large set of schedules
Current Behavior
We see big performance degradation, when there is a task scheduling storm - means having 100k tasks scheduled within 5seconds, the scheduler's lockAndFetch
method will try to do below SQL,
UPDATE xxx_scheduled_tasks st1
SET picked = true, picked_by = 'abc', last_heartbeat = '20000123T012345.678+0900', version = version + 1
WHERE (st1.task_name, st1.task_instance) IN (
SELECT st2.task_name, st2.task_instance FROM xxx_scheduled_tasks st2
WHERE picked = false and execution_time <= '20000123T012345.678+0900' order by execution_time asc LIMIT 20 FOR UPDATE SKIP LOCKED)
RETURNING st1.* ;
In our case, there are 5 instances running this SQL. Unfortunately, Postgres seems always do "LockRow" before the "Limit", below is a sample SQL execution plan,
explain
UPDATE xxx_scheduled_tasks st1
SET picked = true, picked_by = 'abc', last_heartbeat = '20000123T012345.678+0900', version = version + 1
WHERE (st1.task_name, st1.task_instance) IN (
SELECT st2.task_name, st2.task_instance FROM xxx_scheduled_tasks st2
WHERE picked = false and execution_time <= '20000123T012345.678+0900' order by execution_time asc LIMIT 20 FOR UPDATE SKIP LOCKED)
RETURNING st1.* ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Update on xxx_scheduled_tasks st1 (cost=248.34..254.38 rows=1 width=217)
-> Nested Loop (cost=248.34..254.38 rows=1 width=217)
-> HashAggregate (cost=144.08..144.09 rows=1 width=94)
Group Key: "ANY_subquery".task_name, "ANY_subquery".task_instance
-> Subquery Scan on "ANY_subquery" (cost=144.05..144.07 rows=1 width=94)
-> Limit (cost=144.05..144.06 rows=1 width=49)
-> LockRows (cost=144.05..144.06 rows=1 width=49)
-> Sort (cost=144.05..144.05 rows=1 width=49)
Sort Key: st2.execution_time
-> Seq Scan on xxx_scheduled_tasks st2 (cost=0.00..144.04 rows=1 width=49)
Filter: ((NOT picked) AND (execution_time <= '2000-01-22 16:23:45.678+00'::timestamp with time zone))
-> Bitmap Heap Scan on xxx_scheduled_tasks st1 (cost=104.27..108.28 rows=1 width=117)
Recheck Cond: ((task_name = "ANY_subquery".task_name) AND (task_instance = "ANY_subquery".task_instance))
-> Bitmap Index Scan on xxx_scheduled_tasks_pkey (cost=0.00..104.27 rows=1 width=0)
Index Cond: ((task_name = "ANY_subquery".task_name) AND (task_instance = "ANY_subquery".task_instance))
Looking at the plan, it is suspected that the LockRows
is before the Limit
operation, does this mean that when target rows is a big number it will lock all rows in first place before "Limit"
?
With our case, we have 130k tasks (past due) in DB, the query above took 1hour to complete.. If this is a performance limitation for lock-and-fetch
case, maybe it is not caused by DB Scheduler but by PostgreSQL's query planning engine, shall we document this, so people can use this with caution ?
PS: when we switched back to "fetch"
polling policy, this problem went away.
For bug reports
YOU MAY DELETE THE For bug reports
SECTION IF A NEW FEATURE REQUEST.
Steps to Reproduce the bug
- Run 5 instances of DB scheduler (with SpringBoot)
- Schedule 100k + tasks within 2 or 3 seconds
- Observer the performance of DB Scheduler, we see the query may take 1hour to respond.
Context
- DB-Scheduler Version : 11.7
- Java Version : Java 11
- Spring Boot (check for Yes) : [x]
- Database and Version : PG 13.10