Raw file pread from multiple processes #9239

nickva · 2024-12-23T22:23:50Z

Is your feature request related to a problem? Please describe.

I would like to make plain pread calls using raw file handles from non-owner processes.

In CouchDB we use a single raw file handle per database file. We open files in the append-only + read mode, we don't use any extra layers ('read-ahead', 'compression' etc, just plain raw fds). Then, we issue either pread and write API calls. Write calls append to the end of the file (appending headers, copy-on-write btree nodes, etc). Reads are just pread calls at essentially random file offsets. Since all the operations are serialized via a single owner process, a high rate of writes, could block reads. It would be beneficial if we could issue preads calls from non-owner processes so pread calls wouldn't wait on potentially long running write calls to complete.

It seems, at least at the POSIX file descriptor level, pread calls are thread-safe, and can be used from concurrent threads.

From pread2 man page:

The pread() and pwrite() system calls are especially useful in multithreaded applications. They allow multiple threads to perform I/O on the same file descriptor without being affected by changes to the file offset by other threads.

In Erlang it's currently not possible to do that and the pread operations fail when called from non-owner process.

Looking through various forums and discussions online it seemed at first like it would be doable in Erlang as well: https://erlangforums.com/t/share-the-same-file-handle-between-multiple-erlang-processes/3039 but trying it out on the latest OTP version it's still not possible.

Describe the solution you'd like

Have the ability to issue pread calls on raw, plain, file handles. This seems to be possible on Linux, and on Windows as well via the "overlapped" option https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-readfile?redirectedfrom=MSDN#syntax, though not entirely sure here as I am not familiar with the Window API very well.

Describe alternatives you've considered

Memory mapping using emmap library could be an option. That means we'd have to map the whole file to memory and constantly unmap / remap it as it's being appended to. But, since OS file API allows preads from different threads, perhaps it would be possible to bring that feature to Erlang as well.

Also, at first sight, it may seem mmap-ing should be "faster" but that's not necessarily true: at least based on a benchmark someone ran for Rust a fews years ago: https://internals.rust-lang.org/t/introduce-write-at-write-all-at-read-at-read-exact-at-on-windows/19649/23

On Linux, the mmap+copy+unmap approach is significantly costlier (~3x) than the pread approach, because of additional hardware-level overhead: Creating and destroying file mappings invalidates the TLB and makes you eat cache misses.

The text was updated successfully, but these errors were encountered:

At the C API level pread [1] can be called from multiple threads. In Erlang, however, all the calls must be made from a single owner process. In a busy concurrent environment that means all operations for a file descriptor must be serialized, and could pile up in the controller's mailbox having to wait for potentially slower operations to complete. To fix it, add the ability to make raw concurrent pread calls from Erlang as well. File descriptor lifetime is still tied to the owner process, so it gets closed when owner dies. Other processes are only allowed to make pread calls, all others still require going through the owner. Other file layers, like compression and delayed IO, still keep the previous behavior. They have their own `get_fd_data/1` functions per layer which check controller ownership. Concurrent preads are not allowed in those layers. In unix_prim_file.c the seek+read fallback would have required exposing a flag in Erlang in order to keep the old behavior since another process could see the temparily changed position. However, before adding a new flag looked where pread might not be supported, and it seems most Unix-like OSes since Sun0S 5(Solaris 2.5) and AT&T Sys V Rel4 (so all modern BSD) seem to have it. So perhaps, it's safe to remove the fallback altogether and simplify the code? As a precaution kept a configure check with an early failure and a clear message about it. This necessitates updating preloaded beam files so an OTP team member would have to take over the PR, if the idea is acceptable to start with. I didn't commit the beam files so any CI tests might not run either? Issue: erlang#9239 [1] https://www.man7.org/linux/man-pages/man2/pread.2.html

At the C API level pread [1] can be called from multiple threads. In Erlang, however, all the calls must be made from a single owner process. In a busy concurrent environment that means all operations for a file descriptor must be serialized, and could pile up in the controller's mailbox having to wait for potentially slower operations to complete. To fix it, add the ability to make raw concurrent pread calls from Erlang as well. File descriptor lifetime is still tied to the owner process, so it gets closed when owner dies. Other processes are only allowed to make pread calls, all others still require going through the owner. Other file layers, like compression and delayed IO, still keep the previous behavior. They have their own `get_fd_data/1` functions per layer which check controller ownership. Concurrent preads are not allowed in those layers. In unix_prim_file.c the seek+read fallback would have required exposing a flag in Erlang in order to keep the old behavior since another process could see the temporarily changed position. However, before adding a new flag looked where pread might not be supported, and it seems most Unix-like OSes since Sun0S 5(Solaris 2.5) and AT&T Sys V Rel4 (so all modern BSD) seem to have it. So perhaps, it's safe to remove the fallback altogether and simplify the code? As a precaution kept a configure check with an early failure and a clear message about it. This necessitates updating preloaded beam files so an OTP team member would have to take over the PR, if the idea is acceptable to start with. I didn't commit the beam files so any CI tests might not run either? Issue: erlang#9239 [1] https://www.man7.org/linux/man-pages/man2/pread.2.html

nickva added the enhancement label Dec 23, 2024

nickva mentioned this issue Dec 26, 2024

Allow preads from non-owner processes #9250

Closed

jhogberg added the team:VM Assigned to OTP team VM label Dec 27, 2024

jhogberg self-assigned this Jan 8, 2025

jhogberg added the stalled waiting for input by the Erlang/OTP team label Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raw file pread from multiple processes #9239

Raw file pread from multiple processes #9239

nickva commented Dec 23, 2024 •

edited

Loading

Raw file pread from multiple processes #9239

Raw file pread from multiple processes #9239

Comments

nickva commented Dec 23, 2024 • edited Loading

nickva commented Dec 23, 2024 •

edited

Loading