Skip to content

DAOS-18690 vos: handle DTX commit under space pressure#18039

Closed
Nasf-Fan wants to merge 5 commits intomasterfrom
Nasf-Fan/DAOS-18690_2
Closed

DAOS-18690 vos: handle DTX commit under space pressure#18039
Nasf-Fan wants to merge 5 commits intomasterfrom
Nasf-Fan/DAOS-18690_2

Conversation

@Nasf-Fan
Copy link
Copy Markdown
Contributor

If we cannot normally allocate space to hold committed DTX table, then release some old DTX entries from current container to hold new committed ones.

The patch also preallocates some space for TX snapshots. Related logic, such as DTX commit and maybe GC, will switch to emergency mode and use the preallocated buffer in case of space pressure.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link
Copy Markdown

Ticket title is 'Aurora daos_user: SCM single target ran out of space (min:0 B) and not able to finish GC. '
Status is 'In Progress'
https://daosio.atlassian.net/browse/DAOS-18690

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18690_2 branch 2 times, most recently from 7f7c502 to 5c04177 Compare April 18, 2026 13:45
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18039/3/testReport/

@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18039/3/testReport/

test_dfuse_daos_build_wb failed for DAOS-18813, not related with the patch.

@Nasf-Fan Nasf-Fan marked this pull request as ready for review April 21, 2026 00:53
@Nasf-Fan Nasf-Fan requested review from a team as code owners April 21, 2026 00:53
@Nasf-Fan Nasf-Fan requested review from gnailzenh and janekmi April 21, 2026 00:54
@janekmi
Copy link
Copy Markdown
Contributor

janekmi commented Apr 21, 2026

I understand this PR incorporates the solution created here: #17850
But also gives a chance the system to recover in case we run out of space already. Is this the primary reason why we decided not to stop at the solution proposed by #17850 or is there some additional reasoning?

@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

Nasf-Fan commented Apr 22, 2026

I understand this PR incorporates the solution created here: #17850 But also gives a chance the system to recover in case we run out of space already. Is this the primary reason why we decided not to stop at the solution proposed by #17850 or is there some additional reasoning?

Yes, I discussed with Liang, the consideration is that: PR#17850 is some purpose drat for idea feedback, not sure when can be completed. On the other hand, DTX is the key point on the space pressure cycle. Because we often hit the cases like that VOS aggregation is blocked by some non-committed DTX entries, as to it cannot merge some records to release space, then GC will be further affected. And DTX commit logic is also blocked since no space can be released via VOS aggregation and GC. This patch tries to break such bad cycle via DTX internal logic and the pre-allocation idea from PR#17850.

After this patch done, we will rework PR#17850 to handle GC related things in further, that will be relatively easy then.

If we cannot normally allocate space to hold committed DTX table,
then release some old DTX entries from current container to hold
new committed ones.

The patch also preallocates some space for TX snapshots. Related
logic, such as DTX commit and maybe GC, will switch to emergency
mode and use the preallocated buffer in case of space pressure.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18690_2 branch from 5c04177 to e84a9af Compare April 22, 2026 06:00
Comment thread src/include/daos/mem.h Outdated
Comment thread src/vos/vos_dtx.c Outdated
Comment thread src/vos/vos_dtx.c Outdated
Comment thread src/vos/vos_dtx.c Outdated
Comment thread src/vos/vos_layout.h Outdated
Comment thread src/dtx/dtx_srv.c Outdated
Comment thread src/vos/vos_dtx.c
Comment thread src/vos/vos_dtx.c Outdated
Comment thread src/vos/vos_dtx.c Outdated
Comment thread src/vos/vos_dtx.c Outdated

if (ext_df != NULL && !UMOFF_IS_NULL(ext_df->ped_emerg_buf) &&
behavior == TX_FAILURE_RETURN) {
rc = umem_tx_set_snapbuf(umm, ext_df->ped_emerg_buf, VOS_SNAPBUF_EMERG);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this buffer won’t be used by more than one ULT thread at the same time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be no CPU yield during PMEM based VOS transaction. So nobody can share ext_df->ped_emerg_buf until current PMEM TX committed or aborted. It is harmless if the buffer is used by others after current PMEM TX committed, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is harmless if the buffer is used by others after current PMEM TX committed, right?

Yes.

@Nasf-Fan Nasf-Fan requested a review from janekmi April 23, 2026 08:29
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18039/7/testReport/

@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

  • continuous-integration/jenkins/pr-head

test_dfuse_daos_build_wb failed for DAOS-18813, not related with the patch.

@Nasf-Fan Nasf-Fan requested a review from knard38 April 26, 2026 08:07
Comment thread src/vos/vos_dtx.c
Comment thread src/vos/vos_dtx.c
if (rc != 0)
goto out;

umem_tx_set_failure_behavior(umm, TX_FAILURE_RETURN);
Copy link
Copy Markdown
Contributor

@knard38 knard38 Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, the umem_tx_begin() + umem_tx_set_failure_behavior(TX_FAILURE_RETURN) pair appears in all 4 DTX entry points that use vos_dtx_add_ptr(). Combining them into a helper should eliminates the risk of forgetting the second call (which would silently disable the emergency buffer fallback under space pressure).

If I am correct, introducing this new helper could make some sense:

   /* Begin a DTX PMEM transaction with TX_FAILURE_RETURN mode, required
    * for the emergency undo-log buffer fallback in vos_dtx_add_ptr(). */
   static inline int
   vos_dtx_tx_begin(struct umem_instance *umm)
   {
          int rc;

          rc = umem_tx_begin(umm, NULL);
          if (rc == 0)
                  umem_tx_set_failure_behavior(umm, TX_FAILURE_RETURN);
          return rc;
   }

Then, the 4 following calls would become:

   /* vos_dtx_commit() */
   rc = vos_dtx_tx_begin(vos_cont2umm(cont));

   /* vos_dtx_abort_internal() */
   rc = vos_dtx_tx_begin(umm);

   /* vos_dtx_set_flags() */
   rc = vos_dtx_tx_begin(umm);

   /* dtx_blob_aggregate() */
   rc = vos_dtx_tx_begin(umm);

More over, this also makes the D_ASSERT in vos_dtx_add_ptr() a stronger guarantee: any caller not using vos_dtx_tx_begin() would be immediately caught.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will refresh the patch for that.

Comment thread src/vos/vos_dtx.c
int rc;
int i;

/*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agree that a full scan of all containers to find the globally oldest blob would be too costly. Do you think a bounded scan with an early exit could be acceptable? Something like the following — capped at MAX_CONTAINERS_TO_SCAN — would keep the cost effectively O(1) while still covering the common case where a viable victim exists nearby:

   #define MAX_CONTAINERS_TO_SCAN 4

   /* Fallback for vos_dtx_reuse_cmt_blob() when the current container
    * has at most one committed blob. Scans at most MAX_CONTAINERS_TO_SCAN
    * peer containers on the same pool (all xstream-local, no locking needed).
    * Best-effort: may miss victims beyond the scan limit.
    */
   int vos_dtx_steal_cmt_blob(struct vos_container *cont) {
       struct vos_pool      *pool = cont->vc_pool;
       struct vos_container *victim;
       int                   scanned = 0;

       pool_for_each_container(pool, victim) {
           if (victim == cont)
               continue;
           if (scanned++ >= MAX_CONTAINERS_TO_SCAN)
               break;
           if (victim->cd_dtx_committed_head != victim->cd_dtx_committed_tail)
               return vos_dtx_reuse_cmt_blob(victim);
       }
       return -DER_NOSPACE;
   }

One drawback is that iteration always starts from the list head, so the same containers are checked first on every call (potential fairness issue). A round-robin starting point would improve this, though I am not sure it is worth the added complexity.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will optimize the algorithm for the victim a bit, maybe not the same as your method.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing mandatory, just wanted to have your opinion on this for being sure that I have properly understood.

Comment thread src/dtx/dtx_srv.c
Comment thread src/dtx/dtx_rpc.c
}
}

/*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side, it took me some time to understand why we use PARTIAL_COMMITTED + retry instead of just reverting the partial commit.
If I understand correctly, the remote participants commit before the leader, so by the time the leader's local commit partially fails, the data is already visible on remote nodes. Thus, aborting an already-committed DTX there would corrupt it?

For contributors less familiar with the DTX commit protocol, like me, an expanded comment could be helpful:

    /*
     * Remote participants committed before the leader (see ordering comment above).
     * If the leader's local commit partially fails (e.g., -DER_NOSPACE), reverting
     * remote participants is not possible: aborting an already-committed DTX would
     * corrupt data visible on those nodes. Instead, mark all entries in this batch
     * as PARTIAL_COMMITTED so the next batched commit retries them. Re-committing
     * an already-committed DTX is always safe.
     */

This is just a suggestion to improve readability: no issue to keep as-is if you think it is not needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, there is no way to revert partial commit, since we do not know whether someone has already read related partial committed data on related targets. I will add some comment to make things more clear.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, really think it will help new comer such as me.

Comment thread src/common/mem.c
return TX_FAILURE_RETURN;
default:
D_ASSERTF(0, "Unknown TX failure behavior %d\n", behavior);
return -DER_INVAL;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No return after unconditional assert.

Suggested change
return -DER_INVAL;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some static analysis tools may warning as "miss return" or similar for such case.

Comment thread src/vos/vos_layout.h
/* Memory file size for md-on-ssd phase2 pool */
uint64_t ped_mem_sz;
/* emergency buffer for GC */
umem_off_t ped_emerg_buf;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered adding an ability to upgrade already existing pools or is it unlikely a feature like this would be actually helpful?

Comment thread src/vos/vos_dtx.c Outdated

if (ext_df != NULL && !UMOFF_IS_NULL(ext_df->ped_emerg_buf) &&
behavior == TX_FAILURE_RETURN) {
rc = umem_tx_set_snapbuf(umm, ext_df->ped_emerg_buf, VOS_SNAPBUF_EMERG);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is harmless if the buffer is used by others after current PMEM TX committed, right?

Yes.

Comment thread src/dtx/dtx_srv.c

rc1 = dtx_commit_large(cont->sc_hdl, (struct dtx_id *)(din->di_dtx_array.ca_arrays),
din->di_dtx_array.ca_count, false, NULL);
/* The count of DTX entries will not exceed DTX_THRESHOLD_COUNT. */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* The count of DTX entries will not exceed DTX_THRESHOLD_COUNT. */
D_ASSERT(din->di_dtx_array.ca_count <= DTX_THRESHOLD_COUNT);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot directly assertion check the value that is from network. If the sender offered invalid value, in spite of by wrong or intentionally, then dtx_commit_large() will handle that.

Comment thread src/vos/vos_dtx.c
Comment on lines +1059 to +1069
/*
* Space is almost exhausted. Under such case, we must reclaim space to make current
* DTX commit to be proceed; otherwise, uncommitted DTX may block VOS aggregation as
* to prevent further space release. The most direct approach is to reclaim some old
* blob from current container's committed DTX table. It may be unfair because other
* containers could hold older committed DTX entries. However, it maybe not worth to
* scan all pools/containers on the target to find the globally oldest committed DTX
* blob under space pressure. For now, we select current container as the victim.
*
* This can be optimized later. DAOS-18690.
*/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick. This comment seems to apply to the whole vos_dtx_reuse_cmt_blob() function now. Considering it length it would be good to move it there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will re-implement vos_dtx_extend_cmt_table() and fix all related issues related with comment, code style, log message.

Comment thread src/vos/vos_dtx.c
tail = umem_off2ptr(umm, cont_df->cd_dtx_committed_tail);
D_ASSERT(tail != NULL);

dbd = umem_off2ptr(umm, head->dbd_next);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbd and dbd_off are not related despite sharing the dbd part of their name which makes following this function quite confusing to me. Can you please make it straight?

Comment thread src/vos/vos_dtx.c
Comment on lines +1105 to +1113
rc = vos_dtx_add_ptr(cont->vc_pool, head, DTX_CMT_BLOB_SIZE);
if (rc != 0)
goto out;

/* dbd_next is next to dbd_prev */
rc = vos_dtx_add_ptr(cont->vc_pool, &head->dbd_prev,
sizeof(head->dbd_prev) + sizeof(head->dbd_next));
if (rc != 0)
goto out;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these two overlaps and the first one seems a little bit excesive. Or am I missing something?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I will fix it.

Comment thread src/vos/vos_dtx.c

if (count > 0) {
D_ASSERTF(cont->vc_dtx_committed_count >= count,
"Unexpected committed DTX entries count for " DF_UUID ": %u/%u\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Unexpected committed DTX entries count for " DF_UUID ": %u/%u\n",
"Unexpected committed DTX entries count for " DF_UUID ": %"PRIu32"/%"PRIu32"\n",

Comment thread src/vos/vos_dtx.c
DP_UUID(cont->vc_id), cont->vc_dtx_committed_count, count);

cont->vc_dtx_committed_count -= count;
cont->vc_pool->vp_dtx_committed_count -= count;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the counter on the pool level could use the same assert as you wrote for the container level counter.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pool->vp_dtx_committed_count will be always not less than cont->vc_dtx_committed_count, so such check is redundant.

Comment thread src/vos/vos_dtx.c
Comment on lines +1130 to +1137
/* Current @head will be reused, move it after @tail, @dbd will be the new head. */

dbd->dbd_prev = head->dbd_prev;
head->dbd_prev = cont_df->cd_dtx_committed_tail;
tail->dbd_next = cont_df->cd_dtx_committed_head;
cont_df->cd_dtx_committed_head = head->dbd_next;
head->dbd_next = UMOFF_NULL;
cont_df->cd_dtx_committed_tail = dbd_off;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO how it stand right now is rather hard to follow. Can we break it into two steps like this:

Suggested change
/* Current @head will be reused, move it after @tail, @dbd will be the new head. */
dbd->dbd_prev = head->dbd_prev;
head->dbd_prev = cont_df->cd_dtx_committed_tail;
tail->dbd_next = cont_df->cd_dtx_committed_head;
cont_df->cd_dtx_committed_head = head->dbd_next;
head->dbd_next = UMOFF_NULL;
cont_df->cd_dtx_committed_tail = dbd_off;
/* Move current @head after @tail. */
head->dbd_next = UMOFF_NULL;
head->dbd_prev = cont_df->cd_dtx_committed_tail;
tail->dbd_next = cont_df->cd_dtx_committed_head;
cont_df->cd_dtx_committed_tail = cont_df->cd_dtx_committed_head;
/* Make @dbd the new head. */
dbd->dbd_prev = UMOFF_NULL;
cont_df->cd_dtx_committed_head = dbd_off; /* XXX assuming dbd and dbd_off are related */

@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

Have you considered adding an ability to upgrade already existing pools or is it unlikely a feature like this would be actually helpful?

That is good point. We prepare to land a small patch of reserving buffer for space emergency to 2.8, then we do not need to consider handing existing pool without such extension when upgrade from 2.8 to 2.8.x/master. As for the pool from 2.6, there are too much pool layout difference, such as 2.6 pool does not has the whole
vos_pool_ext_df, as to we do not have conclusion about how to upgrade yet.

@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

Have you considered adding an ability to upgrade already existing pools or is it unlikely a feature like this would be actually helpful?

That is good point. We prepare to land a small patch of reserving buffer for space emergency to 2.8, then we do not need to consider handing existing pool without such extension when upgrade from 2.8 to 2.8.x/master. As for the pool from 2.6, there are too much pool layout difference, such as 2.6 pool does not has the whole vos_pool_ext_df, as to we do not have conclusion about how to upgrade yet.

So this patch will be split into two parts, the small part is #18137

@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

Nasf-Fan commented May 5, 2026

Have you considered adding an ability to upgrade already existing pools or is it unlikely a feature like this would be actually helpful?

That is good point. We prepare to land a small patch of reserving buffer for space emergency to 2.8, then we do not need to consider handing existing pool without such extension when upgrade from 2.8 to 2.8.x/master. As for the pool from 2.6, there are too much pool layout difference, such as 2.6 pool does not has the whole vos_pool_ext_df, as to we do not have conclusion about how to upgrade yet.

So this patch will be split into two parts, the small part is #18137

The small part has already been landed to master. Another part is #18141. This one will be closed.

@Nasf-Fan Nasf-Fan closed this May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants