Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raw file pread from multiple processes #9239

Open
nickva opened this issue Dec 23, 2024 · 0 comments
Open

Raw file pread from multiple processes #9239

nickva opened this issue Dec 23, 2024 · 0 comments
Assignees
Labels
enhancement stalled waiting for input by the Erlang/OTP team team:VM Assigned to OTP team VM

Comments

@nickva
Copy link
Contributor

nickva commented Dec 23, 2024

Is your feature request related to a problem? Please describe.

I would like to make plain pread calls using raw file handles from non-owner processes.

In CouchDB we use a single raw file handle per database file. We open files in the append-only + read mode, we don't use any extra layers ('read-ahead', 'compression' etc, just plain raw fds). Then, we issue either pread and write API calls. Write calls append to the end of the file (appending headers, copy-on-write btree nodes, etc). Reads are just pread calls at essentially random file offsets. Since all the operations are serialized via a single owner process, a high rate of writes, could block reads. It would be beneficial if we could issue preads calls from non-owner processes so pread calls wouldn't wait on potentially long running write calls to complete.

It seems, at least at the POSIX file descriptor level, pread calls are thread-safe, and can be used from concurrent threads.

From pread2 man page:

The pread() and pwrite() system calls are especially useful in multithreaded applications. They allow multiple threads to perform I/O on the same file descriptor without being affected by changes to the file offset by other threads.

In Erlang it's currently not possible to do that and the pread operations fail when called from non-owner process.

Looking through various forums and discussions online it seemed at first like it would be doable in Erlang as well: https://erlangforums.com/t/share-the-same-file-handle-between-multiple-erlang-processes/3039 but trying it out on the latest OTP version it's still not possible.

Describe the solution you'd like

Have the ability to issue pread calls on raw, plain, file handles. This seems to be possible on Linux, and on Windows as well via the "overlapped" option https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-readfile?redirectedfrom=MSDN#syntax, though not entirely sure here as I am not familiar with the Window API very well.

Describe alternatives you've considered

Memory mapping using emmap library could be an option. That means we'd have to map the whole file to memory and constantly unmap / remap it as it's being appended to. But, since OS file API allows preads from different threads, perhaps it would be possible to bring that feature to Erlang as well.

Also, at first sight, it may seem mmap-ing should be "faster" but that's not necessarily true: at least based on a benchmark someone ran for Rust a fews years ago: https://internals.rust-lang.org/t/introduce-write-at-write-all-at-read-at-read-exact-at-on-windows/19649/23

On Linux, the mmap+copy+unmap approach is significantly costlier (~3x) than the pread approach, because of additional hardware-level overhead: Creating and destroying file mappings invalidates the TLB and makes you eat cache misses.

nickva added a commit to nickva/otp that referenced this issue Dec 26, 2024
At the C API level pread [1] can be called from multiple threads. In Erlang,
however, all the calls must be made from a single owner process.

In a busy concurrent environment that means all operations for a file
descriptor must be serialized, and could pile up in the controller's mailbox
having to wait for potentially slower operations to complete.

To fix it, add the ability to make raw concurrent pread calls from Erlang as
well. File descriptor lifetime is still tied to the owner process, so it gets
closed when owner dies. Other processes are only allowed to make pread calls,
all others still require going through the owner.

Other file layers, like compression and delayed IO, still keep the previous
behavior. They have their own `get_fd_data/1` functions per layer which check
controller ownership. Concurrent preads are not allowed in those layers.

In unix_prim_file.c the seek+read fallback would have required exposing a flag
in Erlang in order to keep the old behavior since another process could see the
temparily changed position. However, before adding a new flag looked where
pread might not be supported, and it seems most Unix-like OSes since Sun0S
5(Solaris 2.5) and AT&T Sys V Rel4 (so all modern BSD) seem to have it. So
perhaps, it's safe to remove the fallback altogether and simplify the code? As
a precaution kept a configure check with an early failure and a clear message
about it.

This necessitates updating preloaded beam files so an OTP team member would
have to take over the PR, if the idea is acceptable to start with. I didn't
commit the beam files so any CI tests might not run either?

Issue: erlang#9239

[1] https://www.man7.org/linux/man-pages/man2/pread.2.html
nickva added a commit to nickva/otp that referenced this issue Dec 26, 2024
At the C API level pread [1] can be called from multiple threads. In Erlang,
however, all the calls must be made from a single owner process.

In a busy concurrent environment that means all operations for a file
descriptor must be serialized, and could pile up in the controller's mailbox
having to wait for potentially slower operations to complete.

To fix it, add the ability to make raw concurrent pread calls from Erlang as
well. File descriptor lifetime is still tied to the owner process, so it gets
closed when owner dies. Other processes are only allowed to make pread calls,
all others still require going through the owner.

Other file layers, like compression and delayed IO, still keep the previous
behavior. They have their own `get_fd_data/1` functions per layer which check
controller ownership. Concurrent preads are not allowed in those layers.

In unix_prim_file.c the seek+read fallback would have required exposing a flag
in Erlang in order to keep the old behavior since another process could see the
temporarily changed position. However, before adding a new flag looked where
pread might not be supported, and it seems most Unix-like OSes since Sun0S
5(Solaris 2.5) and AT&T Sys V Rel4 (so all modern BSD) seem to have it. So
perhaps, it's safe to remove the fallback altogether and simplify the code? As
a precaution kept a configure check with an early failure and a clear message
about it.

This necessitates updating preloaded beam files so an OTP team member would
have to take over the PR, if the idea is acceptable to start with. I didn't
commit the beam files so any CI tests might not run either?

Issue: erlang#9239

[1] https://www.man7.org/linux/man-pages/man2/pread.2.html
@jhogberg jhogberg added the team:VM Assigned to OTP team VM label Dec 27, 2024
@jhogberg jhogberg self-assigned this Jan 8, 2025
@jhogberg jhogberg added the stalled waiting for input by the Erlang/OTP team label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement stalled waiting for input by the Erlang/OTP team team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

2 participants