Skip to content

Performance degradation with "lock-and-fetch" #441

Open
@meggarr

Description

@meggarr

Prerequisites

Please answer the following questions for yourself before submitting an issue. YOU MAY DELETE THE PREREQUISITES SECTION.

  • I am running the latest version
  • I checked the documentation and found no answer
  • I checked to make sure that this issue has not already been filed

Expected Behavior

lock-and-fetch method can work with large set of schedules

Current Behavior

We see big performance degradation, when there is a task scheduling storm - means having 100k tasks scheduled within 5seconds, the scheduler's lockAndFetch method will try to do below SQL,

UPDATE xxx_scheduled_tasks st1 
  SET picked = true, picked_by = 'abc', last_heartbeat = '20000123T012345.678+0900', version = version + 1
WHERE (st1.task_name, st1.task_instance) IN (
        SELECT st2.task_name, st2.task_instance FROM xxx_scheduled_tasks st2 
        WHERE picked = false and execution_time <= '20000123T012345.678+0900' order by execution_time asc LIMIT 20 FOR UPDATE SKIP LOCKED)
RETURNING st1.* ;

In our case, there are 5 instances running this SQL. Unfortunately, Postgres seems always do "LockRow" before the "Limit", below is a sample SQL execution plan,

explain  
UPDATE xxx_scheduled_tasks st1 
  SET picked = true, picked_by = 'abc', last_heartbeat = '20000123T012345.678+0900', version = version + 1
WHERE (st1.task_name, st1.task_instance) IN (
        SELECT st2.task_name, st2.task_instance FROM xxx_scheduled_tasks st2 
        WHERE picked = false and execution_time <= '20000123T012345.678+0900' order by execution_time asc LIMIT 20 FOR UPDATE SKIP LOCKED)
RETURNING st1.* ;

                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Update on xxx_scheduled_tasks st1  (cost=248.34..254.38 rows=1 width=217)
   ->  Nested Loop  (cost=248.34..254.38 rows=1 width=217)
         ->  HashAggregate  (cost=144.08..144.09 rows=1 width=94)
               Group Key: "ANY_subquery".task_name, "ANY_subquery".task_instance
               ->  Subquery Scan on "ANY_subquery"  (cost=144.05..144.07 rows=1 width=94)
                     ->  Limit  (cost=144.05..144.06 rows=1 width=49)
                           ->  LockRows  (cost=144.05..144.06 rows=1 width=49)
                                 ->  Sort  (cost=144.05..144.05 rows=1 width=49)
                                       Sort Key: st2.execution_time
                                       ->  Seq Scan on xxx_scheduled_tasks st2  (cost=0.00..144.04 rows=1 width=49)
                                             Filter: ((NOT picked) AND (execution_time <= '2000-01-22 16:23:45.678+00'::timestamp with time zone))
         ->  Bitmap Heap Scan on xxx_scheduled_tasks st1  (cost=104.27..108.28 rows=1 width=117)
               Recheck Cond: ((task_name = "ANY_subquery".task_name) AND (task_instance = "ANY_subquery".task_instance))
               ->  Bitmap Index Scan on xxx_scheduled_tasks_pkey  (cost=0.00..104.27 rows=1 width=0)
                     Index Cond: ((task_name = "ANY_subquery".task_name) AND (task_instance = "ANY_subquery".task_instance))

Looking at the plan, it is suspected that the LockRows is before the Limit operation, does this mean that when target rows is a big number it will lock all rows in first place before "Limit" ?

With our case, we have 130k tasks (past due) in DB, the query above took 1hour to complete.. If this is a performance limitation for lock-and-fetch case, maybe it is not caused by DB Scheduler but by PostgreSQL's query planning engine, shall we document this, so people can use this with caution ?

PS: when we switched back to "fetch" polling policy, this problem went away.


For bug reports

YOU MAY DELETE THE For bug reports SECTION IF A NEW FEATURE REQUEST.

Steps to Reproduce the bug

  1. Run 5 instances of DB scheduler (with SpringBoot)
  2. Schedule 100k + tasks within 2 or 3 seconds
  3. Observer the performance of DB Scheduler, we see the query may take 1hour to respond.

Context

  • DB-Scheduler Version : 11.7
  • Java Version : Java 11
  • Spring Boot (check for Yes) : [x]
  • Database and Version : PG 13.10

Logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions