Skip to content

Introduce a temporary directory for deleting layer files outside the lock #2325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Honny1
Copy link
Member

@Honny1 Honny1 commented May 5, 2025

This PR contains the implementation of the temporary directory and the integration of the temporary directory. The temporary directory is used to delete files outside of the global storage lock to reduce the global lock time. Instead of deleting files. Files are moved to the temporary directory by renaming them. Renaming a file is faster than deleting it from disk. After all files have been moved to the temporary directory and the file references have been removed. And the global storage lock is unlocked. The temporary directory is removed.

Fixes: https://issues.redhat.com/browse/RHEL-36087
Fixes: https://issues.redhat.com/browse/RUN-3104

Copy link
Contributor

openshift-ci bot commented May 5, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Honny1
Once this PR has been reviewed and has the lgtm label, please assign rhatdan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Honny1 Honny1 force-pushed the prototype-temp branch 5 times, most recently from ca5a283 to cb73550 Compare May 5, 2025 14:39
@cgwalters
Copy link
Contributor

Design doc: https://docs.google.com/document/d/196S3MWJkuqoKDhkvtnbT6z9LFZzwibIFmcEM9EyJu8Q/edit?usp=sharing

I think we should avoid using private/company-specific Google docs to design public FOSS in general especially as part of the movement of projects in this space to the CNCF.

@Honny1
Copy link
Member Author

Honny1 commented May 5, 2025

@cgwalters ok

@Honny1 Honny1 force-pushed the prototype-temp branch 2 times, most recently from a1810a4 to d4352cd Compare May 5, 2025 15:55
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very preliminary first look — feel free to ignore if feedback is not yet wanted.

At this point I’m mostly focused on the TempDir mechanism’s design/API; callers are interesting to the extent that they represent use cases that need to be handled, but

// Required global storage lock must be held before calling this function.
func recover(rootDir string) error {
tempDirPattern := filepath.Join(rootDir, "temp-dir-*")
tempDirs, err := filepath.Glob(tempDirPattern)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is, this would add a .Glob to every RW lock acquisition. It would be interesting to try to optimize that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just use a parent directory for the temp directories so we don't have to look at their name.

instead of: /foo/run/vfs-layers/temp-dir-28400 use something like /foo/run/vfs-layers/temp-dirs/28400 so we avoid the glob

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I’m wondering about even more optimization — we could have an “might have a staging directory” bool in layers.json, and use the layerStore.lockfile.RecordWrite notification mechanism to make all of this ~zero-cost to users of layerStore.startReading. But, one step at a time.)

@@ -241,6 +245,9 @@ func (d *Driver) dir(id string) string {

// Remove deletes the content from the directory for a given id.
func (d *Driver) Remove(id string) error {
if d.tempDirectory != nil {
return d.tempDirectory.Add(d.dir(id), time.Now().Format("20060102-150405"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d.dir might be in d.imageStore or d.additionalHomes, potentially on different filesystems; a Rename is not guaranteed to work.

(Applies also in other places, and it’s almost certainly out of scope of the prototype.)

@Honny1 Honny1 force-pushed the prototype-temp branch 7 times, most recently from cc142b0 to ce89bc4 Compare May 13, 2025 15:02
@Honny1
Copy link
Member Author

Honny1 commented May 13, 2025

@mtrmac, @giuseppe PTAL, I've updated the prototype.

@Honny1 Honny1 requested review from giuseppe and mtrmac May 13, 2025 15:05
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Good progress, TempDir now looks safe, and the documentation is great.

For Podman service, and CRI-O, we do need layer deletion to happen very soon, not just at store shutdown. And, for those, we might also want the recovery to happen much more frequently (e.g. if someone runs podman rmi as an one-time operation on an OpenShift node, that crashes, and the only other c/storage user is a long-running CRI-O process).

That might, OTOH, make the serialization implied by tempDirGlobalLock more frequent = more costly and less desirable, again. I don’t think we absolutely need that lock?

  • Split instance state and TempDir state. Then there will be no need to do any locking in TempDir itself, for in-process purposes, it would just contain read-only fields. And we can make the instance single-goroutine-owned = again, no need for in-process locks.
  • Move instanceLock out of the instance directory, and change lifetimes so that the lock is created before, and deleted after, the instance directory. Then I think we don’t need tempDirGlobalLock for cross-process purposes, either (? apart from coordinating recovery).

store.go Outdated
@@ -3710,6 +3721,9 @@ func (s *store) Shutdown(force bool) ([]string, error) {
if err == nil {
err = s.graphDriver.Cleanup()
}
if err == nil {
err = s.tempDirectory.Cleanup()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not essential short-term, when looking only at the locking / implementation design:

In all instances: Cleanup would now happen only at process shutdown?

I don’t think that works for the long-running Podman service; we could accumulate a potentially unbounded amount of pending deletes, and it would be impossible to delete unused images on a running server (like CRI-O is doing regularly).

Anyway, I think ideally the deletion of a layer should happen before successful deletion is reported to the caller (unless the caller specifically asks for a background one?), so that the layer is truly deleted when we report success.

@@ -2027,7 +2042,7 @@ func (d *Driver) ListLayers() ([]string, error) {
for _, entry := range entries {
id := entry.Name()
switch id {
case linkDir, stagingDir, quota.BackingFsBlockDeviceLink, mountProgramFlagFile:
case linkDir, stagingDir, tempDir, quota.BackingFsBlockDeviceLink, mountProgramFlagFile:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh… do we need to worry about forward compatibility, i.e. an older version of this code running and unaware of tempDir?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but I moved it to stagingDir. But I'm not sure how it is with recreateSymlinks(). I think it should ignore stagingDir.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but I moved it to stagingDir.

That might make things worse, see it would interfere with pruneStagingDirectories.


But I'm not sure how it is with recreateSymlinks(). I think it should ignore stagingDir.

Yes. It probably works well enough now, because $stagingDir/link doesn’t exist, so the directory is ignored - but it seems attractive to clean this up, going forward.

The very last resort which always works would be to leave d.home for layer directories, and have the caller of the driver provide a separate d.metadataHome where we have no prior commitments.

But, maybe, the answer is that we can leave tempDir in d.home, as long as $tempDir/link is never created?? And should we now at least reserve a new path prefix, e.g. defining that ${d.home}/meta* is never a layer directory?

Or, alternatively, we could define a storage version number, and refuse to work with future versions. That would be drastic but clearly correct, OTOH we would be signing up for very precisely managing this to keep CRI-O and Podman in sync (and for not forgetting to update the version number…)

All of this forward-compatibility design seems valuable, but also somewhat out of scope of this “unlocked deletes” effort — or at least clearly out of scope of this prototype.

Cc: @giuseppe

tempDirGlobalLock: tempDirGlobalLock,
}

if err := td.recoverStaleDirs(rootDir); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is, this would happen only once per process start. That might be too little?! That’s fairly easy to fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in what cases do you think it would be better to check it multiple times? Another process crashing while the current one runs? Would it matter except long running servers (e.g. podman system service)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, long-running servers. Consider CRI-O, if an admin manually runs podman rmi on the node, and the rmi is interrupted.

I was defaulting to thinking that doing this on every store lock/unlock, just like we currently do for layers’ incompleteFlag, makes sense. On second thought, it probably wouldn’t be too bad to only do that (within a process) every 10 seconds or so, and on shutdown; then efficiency would not matter that much.


Also… at least Skopeo never calls store.Shutdown; the store’s documentation doesn’t say it’s required (and the operation documentation reads scary enough that I now spent some time checking that it actually happens, outside of a podman system reset).

@cgwalters
Copy link
Contributor

I only tangentially skimmed this PR but just in case of interest, from the very start ostree has implemented a vaguely similar scheme in https://github.com/ostreedev/ostree/blob/7a32b86ee4c0232bfde97a3cd87c4a82b6e73218/src/libostree/ostree-repo.c#L6292
An important concern for ostree is keeping track of which files have been fsync()'d and to allow resuming.

Comment on lines 62 to 76
Example structure:

rootDir/
.global-tmp-lock (tempDirGlobalLock)
temp-dir-ABC/
.lock (instanceLock for temp-dir-ABC)
file1
file3
temp-dir-XYZ/
.lock (instanceLock for temp-dir-XYZ)
file2
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2325 (comment) still holds.

This structure is subject to a race condition when the .lock file is deleted before the other files in the directory.

So you have process A that holds the lock while it runs rm -rf temp-dir-XYZ, what a new process B should do when it attempts to lock the directory but doesn't find the lock file? 1: Assume the directory is being deleted (process A could crash), or 2: delete it as well?

I think it is safer to have:

temp-dir-XZY/data/
temp-dir-XZY/.lock

first temp-dir-XZY/data/ and if that succeeded proceed to remove temp-dir-XZY.

When process B tries to lock the directory, it will find the lock file until data/ exists, and if it doesn't find lock then it can do rmdir("temp-dir-XZY") because that could happen if the process A crashed after removing the lock file but before removing the directory (or after creating the directory but before the lock file was created)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think we need that directory at all — $root/${instanceID}.lock could mark ownership of $root/${instanceID}.data.

On creation, allocate the lock first. On deletion, unlock and delete the lock last.

(This could even, hypothetically, allow ….data to be a regular file, without the overhead of any extra directories.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the content we are trying to move to the temp-dir has a file/directory named .lock?

Copy link
Collaborator

@mtrmac mtrmac May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This structure is subject to a race condition when the .lock file is deleted before the other files in the directory.

I think that part is good enough — things like os.RemoveAll really shouldn’t crash if a parent is removed in the meantime.

The problem with moving the lock inside the instance directory is at creation time, where we have the instance directory without any lock, so we need to protect against cleanup/recovery immediately deleting it. The current implementation does solve that, by holding the global lock — but that also means that the cleanup/recovery needs to obtain the global lock every time just to check whether anything needs to be cleaned up.

Together with #2325 (review) arguing that we should do the cleanups frequently, not just at process shutdown, that implies obtaining an extra lock file on both layerStore.startReading and layerStore.startWriting. I’d rather prefer to avoid that cost. (I’d even like to avoid the Readdir on every start…ing, but that will require more cooperation with layerStore.)

Copy link
Collaborator

@mtrmac mtrmac May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the content we are trying to move to the temp-dir has a file/directory named .lock?

I think that’s fine?

  • File names under $root are owned by the tempdir package. Callers have no claim to decide names, we would simply not give them the API (and the current code in .Add that bases the file name on the base name of the input can be changed).
  • So, a regular file would be named $root/${instanceID}.data. No ambiguity.
  • If the instance were a larger directory structure (our primary use case now:, a layer deletion in progress), $root/${instanceID}.data would be a directory containing arbitrary contents. We would not be looking for locks inside the ….data subdirectory at all.

Comment on lines 100 to 101
td.tempDirGlobalLock.Lock()
defer td.tempDirGlobalLock.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need the lock here since it is released when the function exits, so for the caller it doesn't matter whether the code below runs with the lock or not.

As @mtrmac pointed out, we need to use os.ReadDir not Walk it recursively

tempDirGlobalLock: tempDirGlobalLock,
}

if err := td.recoverStaleDirs(rootDir); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in what cases do you think it would be better to check it multiple times? Another process crashing while the current one runs? Would it matter except long running servers (e.g. podman system service)?

@Honny1 Honny1 force-pushed the prototype-temp branch 3 times, most recently from dbd4cad to 76529ec Compare May 15, 2025 18:34
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ratholes inside ratholes …

If I’m reading things right, using pkg/lockfile.LockFile for staging directories would leak memory? Cc: @giuseppe to keep me honest.

If so, we need a new idea, or a new primitive. I was thinking a new internal/staginglock that supports

  • CreateAndTryLock
  • TryLockExistingPath
  • UnlockAndDelete

without an ever-growing lock map. If so, can we do that in yet another separate PR, fix overlay’s stagingDirBase, and then return to this one?

But maybe there’s some much simpler approach I’m now missing.

if err != nil {
return fmt.Errorf("getting instance lock file %s for new temp dir %s: %w", tempDirLockPath, actualTempDirPath, err)
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, Recover can find the lock, lock it, and trigger cleanup.

In that case, currently TryLock fails (… but that’s guaranteed only when Recover never unlocks the lock, which we need to fix, see elsewhere), so we can detect that and we know we have to retry.

@Honny1
Copy link
Member Author

Honny1 commented May 16, 2025

I've edited the RecoverStaleDirs, initInstanceTempDir, and Cleanup so they're less racy and marked racy parts. I think that would be solved with internal/staginglock.

@Honny1 Honny1 force-pushed the prototype-temp branch 4 times, most recently from 1b4ca9e to 66e1ef8 Compare May 20, 2025 14:18
@@ -1332,6 +1334,35 @@ func (d *Driver) Remove(id string) error {
return nil
}

func (d *Driver) DeferredRemove(id string) (tempdir.CleanupTempDirFunc, error) {
t, err := tempdir.NewTempDir(filepath.Join(d.homeDirForImageStore(), stagingDir))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, CleanupTempDirFunc capturing the directory, and thus allowing the driver to choose a location without coordination with layerStore, is elegant.

Out of scope for now: Something somewhere in the driver still needs to ensure all such directories will run a cleanup eventually / from time to time.

(Note to self: I suspect homeDirForImageStore might not match d.dir, but that doesn’t matter for now.)

layers.go Outdated
if err := r.deleteInternal(id); err != nil {
return err
cleanFunc, err := r.deleteInternal(id)
rmErr := cleanFunc()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanFunc can be nil, both on errors and some successes.

Comment on lines 257 to 261
if td.tempDirLock == nil {
if err := td.initInstanceTempDir(); err != nil {
return fmt.Errorf("initializing instance temp dir failed: %w", err)
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(If NewTempDir is called only by code that intends to delete something, this initialization can happen in there, as an ~ordinary constructor. But let’s settle the API shape first.)

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the “successful deletion” path now looks pretty good.

I’m not sure about the cleanup locations / timing / locking yet, but, anyway, we need a reliable staging lock as a first step.

layers.go Outdated
@@ -1158,7 +1174,8 @@ func (s *store) newLayerStore(rundir, layerdir, imagedir string, driver drivers.
layersImageDir,
filepath.Join(volatileDir, "volatile-layers.json"),
},
layerdir: layerdir,
layerdir: layerdir,
tempDirRootPath: s.graphRoot,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? Shouldn’t that be a dedicated directory?
Also, it needs to precisely match the directories of data which we are deleting (like tspath/datadir). graphRoot seems to amount to the correct filesystem now, with the current only caller, but tying it more precisely to the layerStore logic would be easier to maintain.

@Honny1
Copy link
Member Author

Honny1 commented Jun 11, 2025

I have integrated the staging lock into the temporary directory, added tests, and addressed feedback.

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very quick first skim: I think the feature set is fine, and the safety is almost certainly fine as well (relying on the previous PR).

I think the recovery timing is not ideal / consistent: after we recover in layerStore.load (assuming, FIXME: double-check, that this happens after ~every time we obtain the RW lock), layerStore.delete[DeferringTheCleanup] should not need to trigger recovery again — we are holding the lock, so there should be nothing new that could possibly be recovered.

So, I’m thinking NewTempDir does not need to trigger a recovery. [And then NewTempDir+initInstanceTempDir could be ~merged because the object is created only immediately before adding items. This is not essential.]

OTOH we don’t currently have a recovery trigger when obtaining a lock in the graph drivers. So that’s inconsistent.

I’m leaning towards thinking that the layerStore model, where recovery is an explicit call, and NewTempDir (imitating a deletion) does not do that, is cleaner — OTOH we would need to trigger the recovery frequently enough (on every RW lock?) and I’m not absolutely sure we can stomach the cost of doing that so often.

@@ -124,6 +125,10 @@ type ProtoDriver interface {
CreateFromTemplate(id, template string, templateIDMappings *idtools.IDMappings, parent string, parentIDMappings *idtools.IDMappings, opts *CreateOpts, readWrite bool) error
// Remove attempts to remove the filesystem layer with this id.
Remove(id string) error
// DeferredRemove is used to remove the filesystem layer with this id
// at a later time, e.g., when the last reference to the layer is removed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think “the last reference” is quite true here? (Or maybe, with bind mounts, it is … but we shouldn’t permit the removal of a layer in use anyway. E.g. layerStore.Delete tries to unmount it first.)

The observable effect of this removal should happen immediately (the layer is no longer usable), but physically deleting the files may be deferred.

layers.go Outdated
if err := r.deleteInternal(id); err != nil {
return err
cleanFunc, err := r.deleteInternal(id)
rmErr := cleanFunc()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, maybe just renaming it to an explicit deleteWhileHoldingLock could be enough.

Here I meant that the current layerStore.Delete (the one that does deferred + immediately calls the cleanups = the slow one) would be renamed to deleteWhileHoldingLock, to emphasize that it’s not ideal and we should migrate away from it.

@Honny1 Honny1 force-pushed the prototype-temp branch 2 times, most recently from cca21d1 to d18bb68 Compare June 17, 2025 09:17
@Honny1
Copy link
Member Author

Honny1 commented Jun 17, 2025

I changed NewTempDir so that it doesn't call recovery, and merged it with initInstanceTempDir. I also implemented triggering recovery on every RW lock.

@Honny1 Honny1 requested review from mtrmac and giuseppe June 18, 2025 11:25

d.releaseAdditionalLayerByID(id)

if err := t.Add(dir); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since there is a lot of code in common, could we refactor the Remove/DeferredRemove code in a new function "doRemove(deferred bool)... and use the deferred bool to decide whether using t.Add or system.EnsureRemoveAll?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored the Remove and DeferredRemove functions.

@Honny1 Honny1 requested a review from giuseppe June 24, 2025 15:33
@Honny1
Copy link
Member Author

Honny1 commented Jun 25, 2025

PTAL @mtrmac

@Honny1 Honny1 changed the title [PROTOTYPE] Temporary directory Introduce a temporary directory for deleting layer files outside the lock Jun 26, 2025
@Honny1 Honny1 marked this pull request as ready for review June 26, 2025 16:19
Honny1 added 2 commits June 26, 2025 18:20
This enables deferred deletion of physical files outside of global locks, improving performance and reducing lock contention.

Signed-off-by: Jan Rodák <[email protected]>
Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a PR for Podman to test the new functionality and see how it would work?

In particular, I'd like to see how the deferred removal would work in a cleanup process

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants