-
Notifications
You must be signed in to change notification settings - Fork 259
Introduce a temporary directory for deleting layer files outside the lock #2325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Honny1 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ca5a283
to
cb73550
Compare
I think we should avoid using private/company-specific Google docs to design public FOSS in general especially as part of the movement of projects in this space to the CNCF. |
@cgwalters ok |
a1810a4
to
d4352cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very preliminary first look — feel free to ignore if feedback is not yet wanted.
At this point I’m mostly focused on the TempDir
mechanism’s design/API; callers are interesting to the extent that they represent use cases that need to be handled, but
pkg/tempdir/tempdir.go
Outdated
// Required global storage lock must be held before calling this function. | ||
func recover(rootDir string) error { | ||
tempDirPattern := filepath.Join(rootDir, "temp-dir-*") | ||
tempDirs, err := filepath.Glob(tempDirPattern) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As is, this would add a .Glob
to every RW lock acquisition. It would be interesting to try to optimize that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just use a parent directory for the temp directories so we don't have to look at their name.
instead of: /foo/run/vfs-layers/temp-dir-28400
use something like /foo/run/vfs-layers/temp-dirs/28400
so we avoid the glob
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I’m wondering about even more optimization — we could have an “might have a staging directory” bool in layers.json
, and use the layerStore.lockfile.RecordWrite
notification mechanism to make all of this ~zero-cost to users of layerStore.startReading
. But, one step at a time.)
drivers/vfs/driver.go
Outdated
@@ -241,6 +245,9 @@ func (d *Driver) dir(id string) string { | |||
|
|||
// Remove deletes the content from the directory for a given id. | |||
func (d *Driver) Remove(id string) error { | |||
if d.tempDirectory != nil { | |||
return d.tempDirectory.Add(d.dir(id), time.Now().Format("20060102-150405")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d.dir
might be in d.imageStore
or d.additionalHomes
, potentially on different filesystems; a Rename
is not guaranteed to work.
(Applies also in other places, and it’s almost certainly out of scope of the prototype.)
cc142b0
to
ce89bc4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Good progress, TempDir
now looks safe, and the documentation is great.
For Podman service, and CRI-O, we do need layer deletion to happen very soon, not just at store shutdown. And, for those, we might also want the recovery to happen much more frequently (e.g. if someone runs podman rmi
as an one-time operation on an OpenShift node, that crashes, and the only other c/storage user is a long-running CRI-O process).
That might, OTOH, make the serialization implied by tempDirGlobalLock
more frequent = more costly and less desirable, again. I don’t think we absolutely need that lock?
- Split instance state and
TempDir
state. Then there will be no need to do any locking inTempDir
itself, for in-process purposes, it would just contain read-only fields. And we can make the instance single-goroutine-owned = again, no need for in-process locks. - Move
instanceLock
out of the instance directory, and change lifetimes so that the lock is created before, and deleted after, the instance directory. Then I think we don’t needtempDirGlobalLock
for cross-process purposes, either (? apart from coordinating recovery).
store.go
Outdated
@@ -3710,6 +3721,9 @@ func (s *store) Shutdown(force bool) ([]string, error) { | |||
if err == nil { | |||
err = s.graphDriver.Cleanup() | |||
} | |||
if err == nil { | |||
err = s.tempDirectory.Cleanup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not essential short-term, when looking only at the locking / implementation design:
In all instances: Cleanup would now happen only at process shutdown?
I don’t think that works for the long-running Podman service; we could accumulate a potentially unbounded amount of pending deletes, and it would be impossible to delete unused images on a running server (like CRI-O is doing regularly).
Anyway, I think ideally the deletion of a layer should happen before successful deletion is reported to the caller (unless the caller specifically asks for a background one?), so that the layer is truly deleted when we report success.
drivers/overlay/overlay.go
Outdated
@@ -2027,7 +2042,7 @@ func (d *Driver) ListLayers() ([]string, error) { | |||
for _, entry := range entries { | |||
id := entry.Name() | |||
switch id { | |||
case linkDir, stagingDir, quota.BackingFsBlockDeviceLink, mountProgramFlagFile: | |||
case linkDir, stagingDir, tempDir, quota.BackingFsBlockDeviceLink, mountProgramFlagFile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh… do we need to worry about forward compatibility, i.e. an older version of this code running and unaware of tempDir
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, but I moved it to stagingDir
. But I'm not sure how it is with recreateSymlinks()
. I think it should ignore stagingDir
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, but I moved it to
stagingDir
.
That might make things worse, see it would interfere with pruneStagingDirectories
.
But I'm not sure how it is with
recreateSymlinks()
. I think it should ignorestagingDir
.
Yes. It probably works well enough now, because $stagingDir/link
doesn’t exist, so the directory is ignored - but it seems attractive to clean this up, going forward.
The very last resort which always works would be to leave d.home
for layer directories, and have the caller of the driver provide a separate d.metadataHome
where we have no prior commitments.
But, maybe, the answer is that we can leave tempDir
in d.home
, as long as $tempDir/link
is never created?? And should we now at least reserve a new path prefix, e.g. defining that ${d.home}/meta*
is never a layer directory?
Or, alternatively, we could define a storage version number, and refuse to work with future versions. That would be drastic but clearly correct, OTOH we would be signing up for very precisely managing this to keep CRI-O and Podman in sync (and for not forgetting to update the version number…)
All of this forward-compatibility design seems valuable, but also somewhat out of scope of this “unlocked deletes” effort — or at least clearly out of scope of this prototype.
Cc: @giuseppe
pkg/tempdir/tempdir.go
Outdated
tempDirGlobalLock: tempDirGlobalLock, | ||
} | ||
|
||
if err := td.recoverStaleDirs(rootDir); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As is, this would happen only once per process start. That might be too little?! That’s fairly easy to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in what cases do you think it would be better to check it multiple times? Another process crashing while the current one runs? Would it matter except long running servers (e.g. podman system service
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, long-running servers. Consider CRI-O, if an admin manually runs podman rmi
on the node, and the rmi
is interrupted.
I was defaulting to thinking that doing this on every store lock/unlock, just like we currently do for layers’ incompleteFlag
, makes sense. On second thought, it probably wouldn’t be too bad to only do that (within a process) every 10 seconds or so, and on shutdown; then efficiency would not matter that much.
Also… at least Skopeo never calls store.Shutdown
; the store’s documentation doesn’t say it’s required (and the operation documentation reads scary enough that I now spent some time checking that it actually happens, outside of a podman system reset
).
I only tangentially skimmed this PR but just in case of interest, from the very start ostree has implemented a vaguely similar scheme in https://github.com/ostreedev/ostree/blob/7a32b86ee4c0232bfde97a3cd87c4a82b6e73218/src/libostree/ostree-repo.c#L6292 |
pkg/tempdir/tempdir.go
Outdated
Example structure: | ||
|
||
rootDir/ | ||
.global-tmp-lock (tempDirGlobalLock) | ||
temp-dir-ABC/ | ||
.lock (instanceLock for temp-dir-ABC) | ||
file1 | ||
file3 | ||
temp-dir-XYZ/ | ||
.lock (instanceLock for temp-dir-XYZ) | ||
file2 | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#2325 (comment) still holds.
This structure is subject to a race condition when the .lock
file is deleted before the other files in the directory.
So you have process A that holds the lock while it runs rm -rf temp-dir-XYZ
, what a new process B should do when it attempts to lock the directory but doesn't find the lock file? 1: Assume the directory is being deleted (process A could crash), or 2: delete it as well?
I think it is safer to have:
temp-dir-XZY/data/
temp-dir-XZY/.lock
first temp-dir-XZY/data/
and if that succeeded proceed to remove temp-dir-XZY
.
When process B tries to lock the directory, it will find the lock file until data/
exists, and if it doesn't find lock
then it can do rmdir("temp-dir-XZY")
because that could happen if the process A crashed after removing the lock file but before removing the directory (or after creating the directory but before the lock file was created)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think we need that directory at all — $root/${instanceID}.lock
could mark ownership of $root/${instanceID}.data
.
On creation, allocate the lock first. On deletion, unlock and delete the lock last.
(This could even, hypothetically, allow ….data
to be a regular file, without the overhead of any extra directories.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the content we are trying to move to the temp-dir has a file/directory named .lock
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This structure is subject to a race condition when the .lock file is deleted before the other files in the directory.
I think that part is good enough — things like os.RemoveAll
really shouldn’t crash if a parent is removed in the meantime.
The problem with moving the lock inside the instance directory is at creation time, where we have the instance directory without any lock, so we need to protect against cleanup/recovery immediately deleting it. The current implementation does solve that, by holding the global lock — but that also means that the cleanup/recovery needs to obtain the global lock every time just to check whether anything needs to be cleaned up.
Together with #2325 (review) arguing that we should do the cleanups frequently, not just at process shutdown, that implies obtaining an extra lock file on both layerStore.startReading
and layerStore.startWriting
. I’d rather prefer to avoid that cost. (I’d even like to avoid the Readdir
on every start…ing
, but that will require more cooperation with layerStore
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the content we are trying to move to the temp-dir has a file/directory named
.lock
?
I think that’s fine?
- File names under
$root
are owned by thetempdir
package. Callers have no claim to decide names, we would simply not give them the API (and the current code in.Add
that bases the file name on the base name of the input can be changed). - So, a regular file would be named
$root/${instanceID}.data
. No ambiguity. - If the instance were a larger directory structure (our primary use case now:, a layer deletion in progress),
$root/${instanceID}.data
would be a directory containing arbitrary contents. We would not be looking for locks inside the ….data
subdirectory at all.
pkg/tempdir/tempdir.go
Outdated
td.tempDirGlobalLock.Lock() | ||
defer td.tempDirGlobalLock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need the lock here since it is released when the function exits, so for the caller it doesn't matter whether the code below runs with the lock or not.
As @mtrmac pointed out, we need to use os.ReadDir
not Walk it recursively
pkg/tempdir/tempdir.go
Outdated
tempDirGlobalLock: tempDirGlobalLock, | ||
} | ||
|
||
if err := td.recoverStaleDirs(rootDir); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in what cases do you think it would be better to check it multiple times? Another process crashing while the current one runs? Would it matter except long running servers (e.g. podman system service
)?
dbd4cad
to
76529ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ratholes inside ratholes …
If I’m reading things right, using pkg/lockfile.LockFile
for staging directories would leak memory? Cc: @giuseppe to keep me honest.
If so, we need a new idea, or a new primitive. I was thinking a new internal/staginglock
that supports
CreateAndTryLock
TryLockExistingPath
UnlockAndDelete
without an ever-growing lock map. If so, can we do that in yet another separate PR, fix overlay’s stagingDirBase
, and then return to this one?
But maybe there’s some much simpler approach I’m now missing.
pkg/tempdir/tempdir.go
Outdated
if err != nil { | ||
return fmt.Errorf("getting instance lock file %s for new temp dir %s: %w", tempDirLockPath, actualTempDirPath, err) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, Recover
can find the lock, lock it, and trigger cleanup.
In that case, currently TryLock
fails (… but that’s guaranteed only when Recover
never unlocks the lock, which we need to fix, see elsewhere), so we can detect that and we know we have to retry.
I've edited the |
1b4ca9e
to
66e1ef8
Compare
drivers/overlay/overlay.go
Outdated
@@ -1332,6 +1334,35 @@ func (d *Driver) Remove(id string) error { | |||
return nil | |||
} | |||
|
|||
func (d *Driver) DeferredRemove(id string) (tempdir.CleanupTempDirFunc, error) { | |||
t, err := tempdir.NewTempDir(filepath.Join(d.homeDirForImageStore(), stagingDir)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, CleanupTempDirFunc
capturing the directory, and thus allowing the driver to choose a location without coordination with layerStore
, is elegant.
Out of scope for now: Something somewhere in the driver still needs to ensure all such directories will run a cleanup eventually / from time to time.
(Note to self: I suspect homeDirForImageStore
might not match d.dir
, but that doesn’t matter for now.)
layers.go
Outdated
if err := r.deleteInternal(id); err != nil { | ||
return err | ||
cleanFunc, err := r.deleteInternal(id) | ||
rmErr := cleanFunc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleanFunc
can be nil
, both on errors and some successes.
pkg/tempdir/tempdir.go
Outdated
if td.tempDirLock == nil { | ||
if err := td.initInstanceTempDir(); err != nil { | ||
return fmt.Errorf("initializing instance temp dir failed: %w", err) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(If NewTempDir
is called only by code that intends to delete something, this initialization can happen in there, as an ~ordinary constructor. But let’s settle the API shape first.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the “successful deletion” path now looks pretty good.
I’m not sure about the cleanup locations / timing / locking yet, but, anyway, we need a reliable staging lock as a first step.
layers.go
Outdated
@@ -1158,7 +1174,8 @@ func (s *store) newLayerStore(rundir, layerdir, imagedir string, driver drivers. | |||
layersImageDir, | |||
filepath.Join(volatileDir, "volatile-layers.json"), | |||
}, | |||
layerdir: layerdir, | |||
layerdir: layerdir, | |||
tempDirRootPath: s.graphRoot, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? Shouldn’t that be a dedicated directory?
Also, it needs to precisely match the directories of data which we are deleting (like tspath
/datadir
). graphRoot
seems to amount to the correct filesystem now, with the current only caller, but tying it more precisely to the layerStore
logic would be easier to maintain.
7804c52
to
19b54e0
Compare
I have integrated the staging lock into the temporary directory, added tests, and addressed feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very quick first skim: I think the feature set is fine, and the safety is almost certainly fine as well (relying on the previous PR).
I think the recovery timing is not ideal / consistent: after we recover in layerStore.load
(assuming, FIXME: double-check, that this happens after ~every time we obtain the RW lock), layerStore.delete[DeferringTheCleanup]
should not need to trigger recovery again — we are holding the lock, so there should be nothing new that could possibly be recovered.
So, I’m thinking NewTempDir
does not need to trigger a recovery. [And then NewTempDir
+initInstanceTempDir
could be ~merged because the object is created only immediately before adding items. This is not essential.]
OTOH we don’t currently have a recovery trigger when obtaining a lock in the graph drivers. So that’s inconsistent.
I’m leaning towards thinking that the layerStore
model, where recovery is an explicit call, and NewTempDir
(imitating a deletion) does not do that, is cleaner — OTOH we would need to trigger the recovery frequently enough (on every RW lock?) and I’m not absolutely sure we can stomach the cost of doing that so often.
drivers/driver.go
Outdated
@@ -124,6 +125,10 @@ type ProtoDriver interface { | |||
CreateFromTemplate(id, template string, templateIDMappings *idtools.IDMappings, parent string, parentIDMappings *idtools.IDMappings, opts *CreateOpts, readWrite bool) error | |||
// Remove attempts to remove the filesystem layer with this id. | |||
Remove(id string) error | |||
// DeferredRemove is used to remove the filesystem layer with this id | |||
// at a later time, e.g., when the last reference to the layer is removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think “the last reference” is quite true here? (Or maybe, with bind mounts, it is … but we shouldn’t permit the removal of a layer in use anyway. E.g. layerStore.Delete
tries to unmount it first.)
The observable effect of this removal should happen immediately (the layer is no longer usable), but physically deleting the files may be deferred.
layers.go
Outdated
if err := r.deleteInternal(id); err != nil { | ||
return err | ||
cleanFunc, err := r.deleteInternal(id) | ||
rmErr := cleanFunc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, maybe just renaming it to an explicit
deleteWhileHoldingLock
could be enough.
Here I meant that the current layerStore.Delete
(the one that does deferred + immediately calls the cleanups = the slow one) would be renamed to deleteWhileHoldingLock
, to emphasize that it’s not ideal and we should migrate away from it.
cca21d1
to
d18bb68
Compare
I changed |
drivers/overlay/overlay.go
Outdated
|
||
d.releaseAdditionalLayerByID(id) | ||
|
||
if err := t.Add(dir); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since there is a lot of code in common, could we refactor the Remove/DeferredRemove code in a new function "doRemove(deferred bool)...
and use the deferred bool to decide whether using t.Add
or system.EnsureRemoveAll
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refactored the Remove and DeferredRemove functions.
PTAL @mtrmac |
Fixes: https://issues.redhat.com/browse/RUN-3104 Signed-off-by: Jan Rodák <[email protected]>
This enables deferred deletion of physical files outside of global locks, improving performance and reducing lock contention. Signed-off-by: Jan Rodák <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have a PR for Podman to test the new functionality and see how it would work?
In particular, I'd like to see how the deferred removal would work in a cleanup process
This PR contains the implementation of the temporary directory and the integration of the temporary directory. The temporary directory is used to delete files outside of the global storage lock to reduce the global lock time. Instead of deleting files. Files are moved to the temporary directory by renaming them. Renaming a file is faster than deleting it from disk. After all files have been moved to the temporary directory and the file references have been removed. And the global storage lock is unlocked. The temporary directory is removed.
Fixes: https://issues.redhat.com/browse/RHEL-36087
Fixes: https://issues.redhat.com/browse/RUN-3104