Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock order violation when calling enif_whereis_pid in a resource down function (ErlNifResourceDown) #8983

Open
qzhuyan opened this issue Oct 24, 2024 · 2 comments · Fixed by #9008
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@qzhuyan
Copy link
Contributor

qzhuyan commented Oct 24, 2024

Describe the bug

Debug beam (emu type) aborts enif_whereis_pid for "Lock order violation" in a resource destructor function.

The pid we are looking for is NOT the process which is 'DOWN'.

Lock order violation occurred when locking 'reg_tab:[][rw_mutex]'(none)!
Currently these locks are locked by the scheduler 2 thread:
  'proc_main:<0.563.0>[proclock](beam/erl_process.c:10059)'(none)
Lock order:
  driver_lock:driver_name
  port_lock:port_id
  port_data_lock:address
  reg_tab
  proc_main:pid
  old_code:address
  nif_call_tab
  nodes_monitors
  meta_name_tab:address
  db_tab:address
  db_tab_fix:address
  db_hash_slot:address
  erl_db_catree_base_node
  erl_db_catree_route_node:index
  resource_monitors:address
  driver_list
  dist_entry:address
  proc_msgq:pid
  proc_btm:pid
  dist_entry_links:address
  nif_load
  update_persistent_term_permission
  persistent_term_delete_permission
  code_stage_permission
  code_mod_permission
  purge_state
  proc_status:pid
  proc_trace:pid
  node_table
  dist_table
  sys_tracers
  export_tab
  fun_tab
  environ
  release_literal_areas
  on_halt
  drv_ev_state_grow
  drv_ev_state:address
  safe_hash:address
  state_prealloc
  schdlr_sspnd
  migration_info_update
  run_queue:address
  dirty_run_queue_sleep_list:address
  dirty_gc_info
  dirty_break_point_index
  process_table
  cpu_info
  pollset:address
  block_poll_thread:index
  alcu_init_atoms
  mseg_init_atoms
  mmap_init_atoms
  drv_tsd
  async_enq_mtx
  msacc_list_mutex
  msacc_unmanaged_mutex
  atom_tab
  dist_entry_out_queue:address
  port_sched_lock:port_id
  sys_msg_q
  tracer_mtx
  port_table
  magic_ref_table:address
  pid_ref_table:address
  instr_x
  instr
  dyn_lock_check
  alcu_allocator:index
  mseg
  get_time
  get_corrected_time
  runtime
  pix_lock:address
  sched_stat
  async_init_mtx
  os_monotonic_time
  erts_alloc_hard_debug
  hard_dbg_mseg
  perf
  jit_debug_descriptor
  erts_mmap
  proc_sig_queue_buffer:address

gdb bt:

gdb) bt full
#0  __pthread_kill_implementation (threadid=281472234541312, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
        tid = 1816913
        ret = 0
        pd = 0xffff5c8dd100
        old_mask = {__val = {281473344907404, 30064771073, 72067, 281470681743367, 187650895853428, 281472234518704, 187650895853480, 
            281473346356472, 5095449628191575040, 281472234519008, 187650894136988, 187650896661455, 187650896661456, 281472234518960, 
            18446743528248704992, 187650896659496}}
        ret = <optimized out>
#1  0x0000ffff9ec0f254 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
No locals.
#2  0x0000ffff9ebca67c in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#3  0x0000ffff9ebb7130 in __GI_abort () at ./stdlib/abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x696d7265705f6567, sa_sigaction = 0x696d7265705f6567}, sa_mask = {__val = {474315584371, 
              18446462598732840960, 18446462598732840960, 5981344612314841139, 9296560754637538184, 5981344612314841139, 9296560754637538184, 
              3435973824, 3435973824, 1740730986315534471, 16737090130532176212, 187650896660904, 187650896659496, 281473346356472, 
              9259542123273814016, 18374403900871475055}}, sa_flags = 2139062143, sa_restorer = 0xffff5c8d7a70}
        sigs = {__val = {32, 18446743528248704992, 187650896661448, 281473346356472, 18446462598732840960, 137438953472, 187650895851160, 
            281472234519120, 281472234519120, 281472234519072, 18446743528248704976, 5095449628191575040, 281472234519120, 281472234519120, 
            281472234519072, 18446743528248704976}}
#4  0x0000aaaae0e310bc in lc_abort () at beam/erl_lock_check.c:661
No locals.
#5  0x0000aaaae0e30ad8 in lock_order_violation (thr=0xffff3c000b70, lck=0xaaaae11bdcf0 <regtab_rwmtx+104>) at beam/erl_lock_check.c:540
No locals.
#6  0x0000aaaae0e32b84 in erts_lc_lock_flg_x (lck=0xaaaae11bdcf0 <regtab_rwmtx+104>, options=2, file=0xaaaae1097310 "beam/register.c", line=70)
    at beam/erl_lock_check.c:1225
        thr = 0xffff3c000b70
        new_ll = 0xffff3c0027c0
        order = 1
        __func__ = "erts_lc_lock_flg_x"
#7  0x0000aaaae0d14eac in erts_rwmtx_rlock_x (rwmtx=0xaaaae11bdc88 <regtab_rwmtx>, file=0xaaaae1097310 "beam/register.c", line=70)
    at beam/erl_threads.h:1991
No locals.
#8  0x0000aaaae0e2aa10 in reg_safe_read_lock (c_p=0x0, c_p_locks=0xffff5c8d7b5c) at beam/register.c:70
        __func__ = "reg_safe_read_lock"
#9  0x0000aaaae0e2b2f0 in erts_whereis_name_to_id (c_p=0x0, name=1205067) at beam/register.c:284
        res = 907
        c_p_locks = 0
        rp = 0xaaaae0e98c00 <call_whereis+60>
        tmpl = {bucket = {next = 0xffff5c8d7bc0, hvalue = 281472234519492}, p = 0xffff5c8d7bc8, pt = 0xffff5c8d7dd0, name = 281472234519472}
        __func__ = "erts_whereis_name_to_id"
#10 0x0000aaaae0e98ca0 in call_whereis (env=0xffff5c8d7dd0, name=1205067) at beam/erl_nif.c:1144
        c_p = 0x0
        res = 281472234519568
        scheduler = 1
        __func__ = "call_whereis"
#11 0x0000aaaae0e98d24 in enif_whereis_pid (env=0xffff5c8d7dd0, name=1205067, pid=0xffff5c8d7c70) at beam/erl_nif.c:1158
        res = 18406424600
#12 0x0000fffee970d89c in tp_snk (env=env@entry=0xffff5c8d7dd0, ctx=ctx@entry=0xfffee98e36d8 "callback", 
    fun=fun@entry=0xfffee98a8178 <__func__.8> "resource_stream_down_callback", tag=tag@entry=0xfffee98a6470 "start", rid=281471688413152, 
    mark=mark@entry=0) at /home/ubuntu/repo/quic/c_src/quicer_tp.c:19
        pid = {pid = 187650896470960}
#13 0x0000fffee96fa7fc in resource_stream_down_callback (env=0xffff5c8d7dd0, ctx=0xaaaaf57c8710, DeadPid=<optimized out>, mon=<optimized out>)
    at /home/ubuntu/repo/quic/c_src/quicer_nif.c:1018
        status = 0
        s_ctx = 0xaaaaf57c8710
        __func__ = "resource_stream_down_callback"
#14 0x0000aaaae0e9e314 in erts_fire_nif_monitor (tmon=0xaaaaf5761610) at beam/erl_nif.c:2984
        resource = 0xaaaaf57c86f8
        mdp = 0xaaaaf57615e8
        omon = 0xaaaaf57615e8
        bin = 0xaaaaf57c86c8
        msg_env = {env = {mod_nif = 0xaaaaf5869ec8, proc = 0xffff5c8d7e18, hp = 0xaaaae11c1f08 <phony_heap>, hp_end = 0xaaaae11c1f08 <phony_heap>, 
            heap_frag = 0x0, tmp_obj_list = 0x0, exception_thrown = 0, tracee = 0x0, exiting = 1552797952, dbg_disable_assert_in_env = 0}, 
          phony_proc = {common = {id = 18446744073709551603, refc = {atmc = {value = 0}, sint = 0}, tracer = 0, trace_flags = 0, timer = {
                value = 0}, u = {alive = {started_interval = 0, reg = 0x0, links = 0x0, lt_monitors = 0x0, monitors = 0x0}, release = {later = 0, 
                  func = 0x0, data = 0x0, next = 0x0}}}, htop = 0xaaaae11c1f08 <phony_heap>, stop = 0xaaaae11c1f20 <phony_heap+24>, fcalls = 0, 
            freason = 0, fvalue = 0, heap = 0xaaaae11c1f08 <phony_heap>, hend = 0xaaaae11c1f20 <phony_heap+24>, abandoned_heap = 0x0, heap_sz = 0, 
            min_heap_size = 0, min_vheap_size = 0, max_heap_size = 0, arity = 0, arg_reg = 0x0, max_arg_reg = 0, def_arg_reg = {0, 0, 0, 0, 0, 0}, 
            i = 0x0, catches = 0, rcount = 0, schedule_count = 0, reds = 0, flags = 0, group_leader = 0, ftrace = 0, next = 0x0, uniq = 0, sig_qs = {
              first = 0x0, last = 0x0, save = 0x0, cont = 0x0, cont_last = 0x0, nmsigs = {next = 0x0, last = 0x0}, recv_mrk_blk = 0x0, len = 0, 
              flags = 0}, bif_timers = 0x0, dictionary = 0x0, seq_trace_clock = 0, seq_trace_lastcnt = 0, seq_trace_token = 0, u = {real_proc = 0x0, 
              terminate = 0x0, initial = {module = 0, function = 0, arity = 0}}, current = 0x0, parent = 0, static_flags = 0, high_water = 0x0, 
            old_hend = 0x0, old_htop = 0x0, old_heap = 0x0, gen_gcs = 0, max_gen_gcs = 0, off_heap = {first = 0x0, overhead = 0}, wrt_bins = 0x0, 
            mbuf = 0x0, live_hf_end = 0x0, msg_frag = 0x0, mbuf_sz = 0, psd = {value = 0}, bin_vheap_sz = 0, bin_old_vheap_sz = 0, 
--Type <RET> for more, q to quit, c to continue without paging--
            bin_old_vheap = 0, sys_task_qs = 0x0, dirty_sys_tasks = 0x0, state = {value = 0}, xstate = {value = 0}, sig_inq_contention_counter = 0, 
            sig_inq = {first = 0x0, last = 0x0, len = 0, nmsigs = {next = 0x0, last = 0x0}}, sig_inq_buffers = {value = 0}, trace_msg_q = 0x0, 
            lock = {flags = {value = 0}, queue = {0x0, 0x0, 0x0, 0x0, 0x0}, locked = {{value = 0}, {value = 0}, {value = 0}, {value = 0}, {
                  value = 0}}}, scheduler_data = 0x0, run_queue = {value = 0}, last_htop = 0x0, last_mbuf = 0x0, heap_hfrag = 0x0, 
            last_old_htop = 0x0, debug_reds_in = 0}}
        nif_pid = {pid = 72067}
        nif_monitor = {data = "\220\000\000\000\000\000\000\000\377\377\377\377'h\001\000\002\000\240G\310Q\347B0\177\215\\\377\377\000"}
        rmp = 0xaaaaf57c87e0
        mrefc = 1
        brefc = 5
        active = 1
        is_dying = 0
        __func__ = "erts_fire_nif_monitor"
#15 0x0000aaaae0bd2d88 in erts_proc_exit_handle_monitor (mon=0xaaaaf5761610, vctxt=0xffff5c8d8580, reds=3846) at beam/erl_process.c:13601
        ctxt = 0xffff5c8d8580
        c_p = 0xffff49f49920
        reason = 523
        mdp = 0x0
        res = 1
        __func__ = "erts_proc_exit_handle_monitor"
#16 0x0000aaaae0e5fd0c in ml_dl_list_foreach_delete_yielding (list=0xffff5c8d8570, func=0xaaaae0bd2c04 <erts_proc_exit_handle_monitor>, 
    arg=0xffff5c8d8580, vyspp=0xffff5c8d85c8, reds=3846) at beam/erl_monitor_link.c:517
        next = 0xaaaaf5761610
        first = 0xaaaaf5761610
        ml = 0xaaaaf5761610
        __func__ = "ml_dl_list_foreach_delete_yielding"
#17 0x0000aaaae0e60938 in erts_monitor_list_foreach_delete_yielding (list=0xffff5c8d8570, func=0xaaaae0bd2c04 <erts_proc_exit_handle_monitor>, 
    arg=0xffff5c8d8580, vyspp=0xffff5c8d85c8, limit=3846) at beam/erl_monitor_link.c:811
No locals.
#18 0x0000aaaae0bd4bec in erts_continue_exit_process (p=0xffff49f49920) at beam/erl_process.c:14318
        static_state = {phase = ERTS_CONTINUE_EXIT_LT_MONITORS, links = 0x0, monitors = 0x0, lt_monitors = 0xaaaaf5761610, reason = 523, pectxt = {
            c_p = 0xffff49f49920, reason = 523, dist_links = 0x0, dist_monitors = 0x0, pend_spawn_monitors = 0x0, wait_pend_spawn_monitor = 0x0, 
            dist_state = 59, yield = 0}, dep = 0x0, yield_state = 0x0, block_rla_ref = 0}
        trap_state = 0xffff5c8d8558
        curr_locks = 1
        state = 1092650
        delay_del_proc = 0
        reds = 3846
        yield_allowed = 1
        __func__ = "erts_continue_exit_process"
#19 0x0000aaaae0bd404c in erts_do_exit_process (p=0xffff49f49920, reason=523) at beam/erl_process.c:13987
        __func__ = "erts_do_exit_process"
#20 0x0000ffff560031e0 in ?? ()
No symbol table info available.
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

To Reproduce

  1. Define NIF resource and resource down callback fun
  2. Start a process and make resource monitor on that process
  3. kill the process
  4. In resource down callback fun call enif_whereis_pid,
    int enif_whereis_pid(ErlNifEnv *caller_env,
           ERL_NIF_TERM name, ErlNifPid *pid);
    
    where caller_env is from the resource down callback fun input arg or NULL

If this is unexpected behavior, I could write a minimal NIF code to reproduce it.

Expected behavior
Should not abort the execution.

Affected versions
Found in OTP26.2.5.2

Additional context

  1. In the resource callback function, we look for the pid of a tracing event collection process() then send a message to that for testing/tracing/debugging purpose.

We have workaround to write that ErlNifPid pid to a global var as in emqx/quic#315

  1. We don't get any error with other emu type such as opt and asan
@qzhuyan qzhuyan added the bug Issue is reported as a bug label Oct 24, 2024
@garazdawi garazdawi added the team:VM Assigned to OTP team VM label Oct 25, 2024
@sverker
Copy link
Contributor

sverker commented Oct 30, 2024

Thanks for the good report. Proposed fix in #9008.

@sverker
Copy link
Contributor

sverker commented Oct 30, 2024

We don't get any error with other emu type such as opt and asan

The lock checker is only enabled in debug build or if configured with --enable-lock-checking.
The bug is still there though, which could (maybe?) lead to deadlock if unlucky.

garazdawi pushed a commit to garazdawi/otp that referenced this issue Nov 13, 2024
…/OTP-19330' into maint-25

* sverker/erts/nif-monitor-down-lock-conflict/erlangGH-8983/OTP-19330:
  erts: Fix potential lock order conflict in NIF monitor down callbacks
IngelaAndin pushed a commit that referenced this issue Dec 5, 2024
…-19330' into maint-26

* sverker/erts/nif-monitor-down-lock-conflict/GH-8983/OTP-19330:
  erts: Fix potential lock order conflict in NIF monitor down callbacks
IngelaAndin pushed a commit that referenced this issue Dec 5, 2024
…-19330' into maint-25

* sverker/erts/nif-monitor-down-lock-conflict/GH-8983/OTP-19330:
  erts: Fix potential lock order conflict in NIF monitor down callbacks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants