DAOS-18690 vos: handle DTX commit under space pressure by Nasf-Fan · Pull Request #18039 · daos-stack/daos

Nasf-Fan · 2026-04-18T02:27:55Z

If we cannot normally allocate space to hold committed DTX table, then release some old DTX entries from current container to hold new committed ones.

The patch also preallocates some space for TX snapshots. Related logic, such as DTX commit and maybe GC, will switch to emergency mode and use the preallocated buffer in case of space pressure.

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

github-actions · 2026-04-18T02:28:10Z

Ticket title is 'Aurora daos_user: SCM single target ran out of space (min:0 B) and not able to finish GC. '
Status is 'In Progress'
https://daosio.atlassian.net/browse/DAOS-18690

daosbuild3 · 2026-04-19T04:12:57Z

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18039/3/testReport/

Nasf-Fan · 2026-04-20T02:27:56Z

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18039/3/testReport/

test_dfuse_daos_build_wb failed for DAOS-18813, not related with the patch.

janekmi · 2026-04-21T13:55:19Z

I understand this PR incorporates the solution created here: #17850
But also gives a chance the system to recover in case we run out of space already. Is this the primary reason why we decided not to stop at the solution proposed by #17850 or is there some additional reasoning?

Nasf-Fan · 2026-04-22T02:33:16Z

I understand this PR incorporates the solution created here: #17850 But also gives a chance the system to recover in case we run out of space already. Is this the primary reason why we decided not to stop at the solution proposed by #17850 or is there some additional reasoning?

Yes, I discussed with Liang, the consideration is that: PR#17850 is some purpose drat for idea feedback, not sure when can be completed. On the other hand, DTX is the key point on the space pressure cycle. Because we often hit the cases like that VOS aggregation is blocked by some non-committed DTX entries, as to it cannot merge some records to release space, then GC will be further affected. And DTX commit logic is also blocked since no space can be released via VOS aggregation and GC. This patch tries to break such bad cycle via DTX internal logic and the pre-allocation idea from PR#17850.

After this patch done, we will rework PR#17850 to handle GC related things in further, that will be relatively easy then.

If we cannot normally allocate space to hold committed DTX table, then release some old DTX entries from current container to hold new committed ones. The patch also preallocates some space for TX snapshots. Related logic, such as DTX commit and maybe GC, will switch to emergency mode and use the preallocated buffer in case of space pressure. Signed-off-by: Fan Yong <fan.yong@hpe.com>

janekmi · 2026-04-22T18:48:08Z

+
+		if (ext_df != NULL && !UMOFF_IS_NULL(ext_df->ped_emerg_buf) &&
+		    behavior == TX_FAILURE_RETURN) {
+			rc = umem_tx_set_snapbuf(umm, ext_df->ped_emerg_buf, VOS_SNAPBUF_EMERG);


Are you sure this buffer won’t be used by more than one ULT thread at the same time?

There will be no CPU yield during PMEM based VOS transaction. So nobody can share ext_df->ped_emerg_buf until current PMEM TX committed or aborted. It is harmless if the buffer is used by others after current PMEM TX committed, right?

It is harmless if the buffer is used by others after current PMEM TX committed, right?

Yes.

Signed-off-by: Fan Yong <fan.yong@hpe.com>

daosbuild3 · 2026-04-24T19:36:42Z

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18039/7/testReport/

Nasf-Fan · 2026-04-26T08:05:37Z

continuous-integration/jenkins/pr-head

test_dfuse_daos_build_wb failed for DAOS-18813, not related with the patch.

knard38 · 2026-04-27T09:12:22Z

 	if (rc != 0)
 		goto out;

+	umem_tx_set_failure_behavior(umm, TX_FAILURE_RETURN);


From my understanding, the umem_tx_begin() + umem_tx_set_failure_behavior(TX_FAILURE_RETURN) pair appears in all 4 DTX entry points that use vos_dtx_add_ptr(). Combining them into a helper should eliminates the risk of forgetting the second call (which would silently disable the emergency buffer fallback under space pressure).

If I am correct, introducing this new helper could make some sense:

/* Begin a DTX PMEM transaction with TX_FAILURE_RETURN mode, required * for the emergency undo-log buffer fallback in vos_dtx_add_ptr(). */ static inline int vos_dtx_tx_begin(struct umem_instance *umm) { int rc; rc = umem_tx_begin(umm, NULL); if (rc == 0) umem_tx_set_failure_behavior(umm, TX_FAILURE_RETURN); return rc; }

Then, the 4 following calls would become:

/* vos_dtx_commit() */ rc = vos_dtx_tx_begin(vos_cont2umm(cont)); /* vos_dtx_abort_internal() */ rc = vos_dtx_tx_begin(umm); /* vos_dtx_set_flags() */ rc = vos_dtx_tx_begin(umm); /* dtx_blob_aggregate() */ rc = vos_dtx_tx_begin(umm);

More over, this also makes the D_ASSERT in vos_dtx_add_ptr() a stronger guarantee: any caller not using vos_dtx_tx_begin() would be immediately caught.

I will refresh the patch for that.

knard38 · 2026-04-27T10:08:20Z

+	int                        rc;
+	int                        i;
+
+	/*


Fully agree that a full scan of all containers to find the globally oldest blob would be too costly. Do you think a bounded scan with an early exit could be acceptable? Something like the following — capped at MAX_CONTAINERS_TO_SCAN — would keep the cost effectively O(1) while still covering the common case where a viable victim exists nearby:

#define MAX_CONTAINERS_TO_SCAN 4 /* Fallback for vos_dtx_reuse_cmt_blob() when the current container * has at most one committed blob. Scans at most MAX_CONTAINERS_TO_SCAN * peer containers on the same pool (all xstream-local, no locking needed). * Best-effort: may miss victims beyond the scan limit. */ int vos_dtx_steal_cmt_blob(struct vos_container *cont) { struct vos_pool *pool = cont->vc_pool; struct vos_container *victim; int scanned = 0; pool_for_each_container(pool, victim) { if (victim == cont) continue; if (scanned++ >= MAX_CONTAINERS_TO_SCAN) break; if (victim->cd_dtx_committed_head != victim->cd_dtx_committed_tail) return vos_dtx_reuse_cmt_blob(victim); } return -DER_NOSPACE; }

One drawback is that iteration always starts from the list head, so the same containers are checked first on every call (potential fairness issue). A round-robin starting point would improve this, though I am not sure it is worth the added complexity.

OK, I will optimize the algorithm for the victim a bit, maybe not the same as your method.

Nothing mandatory, just wanted to have your opinion on this for being sure that I have properly understood.

knard38 · 2026-04-27T13:11:53Z

 			}
 		}

 		/*


From my side, it took me some time to understand why we use PARTIAL_COMMITTED + retry instead of just reverting the partial commit.
If I understand correctly, the remote participants commit before the leader, so by the time the leader's local commit partially fails, the data is already visible on remote nodes. Thus, aborting an already-committed DTX there would corrupt it?

For contributors less familiar with the DTX commit protocol, like me, an expanded comment could be helpful:

/* * Remote participants committed before the leader (see ordering comment above). * If the leader's local commit partially fails (e.g., -DER_NOSPACE), reverting * remote participants is not possible: aborting an already-committed DTX would * corrupt data visible on those nodes. Instead, mark all entries in this batch * as PARTIAL_COMMITTED so the next batched commit retries them. Re-committing * an already-committed DTX is always safe. */

This is just a suggestion to improve readability: no issue to keep as-is if you think it is not needed.

Currently, there is no way to revert partial commit, since we do not know whether someone has already read related partial committed data on related targets. I will add some comment to make things more clear.

Thanks, really think it will help new comer such as me.

janekmi · 2026-04-28T13:29:37Z

+		return TX_FAILURE_RETURN;
+	default:
+		D_ASSERTF(0, "Unknown TX failure behavior %d\n", behavior);
+		return -DER_INVAL;


No return after unconditional assert.

Suggested change

return -DER_INVAL;

Some static analysis tools may warning as "miss return" or similar for such case.

janekmi · 2026-04-28T13:33:48Z

 	/* Memory file size for md-on-ssd phase2 pool */
 	uint64_t                ped_mem_sz;
+	/* emergency buffer for GC */
+	umem_off_t              ped_emerg_buf;


Have you considered adding an ability to upgrade already existing pools or is it unlikely a feature like this would be actually helpful?

janekmi · 2026-04-28T13:55:17Z

+
+		if (ext_df != NULL && !UMOFF_IS_NULL(ext_df->ped_emerg_buf) &&
+		    behavior == TX_FAILURE_RETURN) {
+			rc = umem_tx_set_snapbuf(umm, ext_df->ped_emerg_buf, VOS_SNAPBUF_EMERG);


It is harmless if the buffer is used by others after current PMEM TX committed, right?

Yes.

janekmi · 2026-04-28T14:32:15Z


-		rc1 = dtx_commit_large(cont->sc_hdl, (struct dtx_id *)(din->di_dtx_array.ca_arrays),
-				       din->di_dtx_array.ca_count, false, NULL);
+		/* The count of DTX entries will not exceed DTX_THRESHOLD_COUNT. */


Suggested change

/* The count of DTX entries will not exceed DTX_THRESHOLD_COUNT. */

D_ASSERT(din->di_dtx_array.ca_count <= DTX_THRESHOLD_COUNT);

We cannot directly assertion check the value that is from network. If the sender offered invalid value, in spite of by wrong or intentionally, then dtx_commit_large() will handle that.

janekmi · 2026-04-28T14:51:12Z

+	/*
+	 * Space is almost exhausted. Under such case, we must reclaim space to make current
+	 * DTX commit to be proceed; otherwise, uncommitted DTX may block VOS aggregation as
+	 * to prevent further space release. The most direct approach is to reclaim some old
+	 * blob from current container's committed DTX table. It may be unfair because other
+	 * containers could hold older committed DTX entries. However, it maybe not worth to
+	 * scan all pools/containers on the target to find the globally oldest committed DTX
+	 * blob under space pressure. For now, we select current container as the victim.
+	 *
+	 * This can be optimized later. DAOS-18690.
+	 */


Nitpick. This comment seems to apply to the whole vos_dtx_reuse_cmt_blob() function now. Considering it length it would be good to move it there.

I will re-implement vos_dtx_extend_cmt_table() and fix all related issues related with comment, code style, log message.

janekmi · 2026-04-28T19:46:49Z

+	tail = umem_off2ptr(umm, cont_df->cd_dtx_committed_tail);
+	D_ASSERT(tail != NULL);
+
+	dbd = umem_off2ptr(umm, head->dbd_next);


dbd and dbd_off are not related despite sharing the dbd part of their name which makes following this function quite confusing to me. Can you please make it straight?

janekmi · 2026-04-28T19:49:58Z

+	rc = vos_dtx_add_ptr(cont->vc_pool, head, DTX_CMT_BLOB_SIZE);
+	if (rc != 0)
+		goto out;
+
+	/* dbd_next is next to dbd_prev */
+	rc = vos_dtx_add_ptr(cont->vc_pool, &head->dbd_prev,
+			     sizeof(head->dbd_prev) + sizeof(head->dbd_next));
+	if (rc != 0)
+		goto out;


It looks like these two overlaps and the first one seems a little bit excesive. Or am I missing something?

Right, I will fix it.

janekmi · 2026-04-28T19:52:51Z

+
+	if (count > 0) {
+		D_ASSERTF(cont->vc_dtx_committed_count >= count,
+			  "Unexpected committed DTX entries count for " DF_UUID ": %u/%u\n",


Suggested change

"Unexpected committed DTX entries count for " DF_UUID ": %u/%u\n",

"Unexpected committed DTX entries count for " DF_UUID ": %"PRIu32"/%"PRIu32"\n",

janekmi · 2026-04-28T19:53:48Z

+			  DP_UUID(cont->vc_id), cont->vc_dtx_committed_count, count);
+
+		cont->vc_dtx_committed_count -= count;
+		cont->vc_pool->vp_dtx_committed_count -= count;


I think the counter on the pool level could use the same assert as you wrote for the container level counter.

pool->vp_dtx_committed_count will be always not less than cont->vc_dtx_committed_count, so such check is redundant.

janekmi · 2026-04-28T20:18:19Z

+	/* Current @head will be reused, move it after @tail, @dbd will be the new head. */
+
+	dbd->dbd_prev                  = head->dbd_prev;
+	head->dbd_prev                 = cont_df->cd_dtx_committed_tail;
+	tail->dbd_next                 = cont_df->cd_dtx_committed_head;
+	cont_df->cd_dtx_committed_head = head->dbd_next;
+	head->dbd_next                 = UMOFF_NULL;
+	cont_df->cd_dtx_committed_tail = dbd_off;


IMHO how it stand right now is rather hard to follow. Can we break it into two steps like this:

Suggested change

/* Current @head will be reused, move it after @tail, @dbd will be the new head. */

dbd->dbd_prev = head->dbd_prev;

head->dbd_prev = cont_df->cd_dtx_committed_tail;

tail->dbd_next = cont_df->cd_dtx_committed_head;

cont_df->cd_dtx_committed_head = head->dbd_next;

head->dbd_next = UMOFF_NULL;

cont_df->cd_dtx_committed_tail = dbd_off;

/* Move current @head after @tail. */

head->dbd_next = UMOFF_NULL;

head->dbd_prev = cont_df->cd_dtx_committed_tail;

tail->dbd_next = cont_df->cd_dtx_committed_head;

cont_df->cd_dtx_committed_tail = cont_df->cd_dtx_committed_head;

/* Make @dbd the new head. */

dbd->dbd_prev = UMOFF_NULL;

cont_df->cd_dtx_committed_head = dbd_off; /* XXX assuming dbd and dbd_off are related */

Nasf-Fan · 2026-04-29T13:22:15Z

Have you considered adding an ability to upgrade already existing pools or is it unlikely a feature like this would be actually helpful?

That is good point. We prepare to land a small patch of reserving buffer for space emergency to 2.8, then we do not need to consider handing existing pool without such extension when upgrade from 2.8 to 2.8.x/master. As for the pool from 2.6, there are too much pool layout difference, such as 2.6 pool does not has the whole
vos_pool_ext_df, as to we do not have conclusion about how to upgrade yet.

Nasf-Fan · 2026-04-30T01:59:29Z

Have you considered adding an ability to upgrade already existing pools or is it unlikely a feature like this would be actually helpful?

That is good point. We prepare to land a small patch of reserving buffer for space emergency to 2.8, then we do not need to consider handing existing pool without such extension when upgrade from 2.8 to 2.8.x/master. As for the pool from 2.6, there are too much pool layout difference, such as 2.6 pool does not has the whole vos_pool_ext_df, as to we do not have conclusion about how to upgrade yet.

So this patch will be split into two parts, the small part is #18137

Nasf-Fan · 2026-05-05T04:30:40Z

Have you considered adding an ability to upgrade already existing pools or is it unlikely a feature like this would be actually helpful?

That is good point. We prepare to land a small patch of reserving buffer for space emergency to 2.8, then we do not need to consider handing existing pool without such extension when upgrade from 2.8 to 2.8.x/master. As for the pool from 2.6, there are too much pool layout difference, such as 2.6 pool does not has the whole vos_pool_ext_df, as to we do not have conclusion about how to upgrade yet.

So this patch will be split into two parts, the small part is #18137

The small part has already been landed to master. Another part is #18141. This one will be closed.

Nasf-Fan force-pushed the Nasf-Fan/DAOS-18690_2 branch 2 times, most recently from 7f7c502 to 5c04177 Compare April 18, 2026 13:45

Nasf-Fan marked this pull request as ready for review April 21, 2026 00:53

Nasf-Fan requested review from a team as code owners April 21, 2026 00:53

Nasf-Fan requested review from gnailzenh and janekmi April 21, 2026 00:54

Nasf-Fan force-pushed the Nasf-Fan/DAOS-18690_2 branch from 5c04177 to e84a9af Compare April 22, 2026 06:00

janekmi requested changes Apr 22, 2026

View reviewed changes

Nasf-Fan added 2 commits April 23, 2026 16:11

Merge branch 'master' into Nasf-Fan/DAOS-18690_2

eea16c4

DAOS-18690 vos: fixes for review feedback

5d3fac8

Signed-off-by: Fan Yong <fan.yong@hpe.com>

Nasf-Fan requested a review from janekmi April 23, 2026 08:29

Nasf-Fan added 2 commits April 24, 2026 10:13

DAOS-18690 vos: code format adjustment

3bc77c6

Signed-off-by: Fan Yong <fan.yong@hpe.com>

Merge branch 'master' into Nasf-Fan/DAOS-18690_2

c8b79cb

Nasf-Fan requested a review from knard38 April 26, 2026 08:07

knard38 reviewed Apr 27, 2026

View reviewed changes

janekmi requested changes Apr 28, 2026

View reviewed changes

Nasf-Fan closed this May 5, 2026

+              	int                        rc;
+              	int                        i;
+              	/*

               			}
               		}
               		/*

	/* The count of DTX entries will not exceed DTX_THRESHOLD_COUNT. */
	D_ASSERT(din->di_dtx_array.ca_count <= DTX_THRESHOLD_COUNT);

	"Unexpected committed DTX entries count for " DF_UUID ": %u/%u\n",
	"Unexpected committed DTX entries count for " DF_UUID ": %"PRIu32"/%"PRIu32"\n",

Conversation

Nasf-Fan commented Apr 18, 2026

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

daosbuild3 commented Apr 19, 2026

Uh oh!

Nasf-Fan commented Apr 20, 2026

Uh oh!

janekmi commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nasf-Fan commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daosbuild3 commented Apr 24, 2026

Uh oh!

Nasf-Fan commented Apr 26, 2026

Uh oh!

Uh oh!

knard38 Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janekmi commented Apr 21, 2026 •

edited

Loading

Nasf-Fan commented Apr 22, 2026 •

edited

Loading

knard38 Apr 27, 2026 •

edited

Loading