DAOS-18690 vos: handle DTX commit under space pressure#18039
DAOS-18690 vos: handle DTX commit under space pressure#18039
Conversation
|
Ticket title is 'Aurora daos_user: SCM single target ran out of space (min:0 B) and not able to finish GC. ' |
7f7c502 to
5c04177
Compare
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18039/3/testReport/ |
test_dfuse_daos_build_wb failed for DAOS-18813, not related with the patch. |
Yes, I discussed with Liang, the consideration is that: PR#17850 is some purpose drat for idea feedback, not sure when can be completed. On the other hand, DTX is the key point on the space pressure cycle. Because we often hit the cases like that VOS aggregation is blocked by some non-committed DTX entries, as to it cannot merge some records to release space, then GC will be further affected. And DTX commit logic is also blocked since no space can be released via VOS aggregation and GC. This patch tries to break such bad cycle via DTX internal logic and the pre-allocation idea from PR#17850. After this patch done, we will rework PR#17850 to handle GC related things in further, that will be relatively easy then. |
If we cannot normally allocate space to hold committed DTX table, then release some old DTX entries from current container to hold new committed ones. The patch also preallocates some space for TX snapshots. Related logic, such as DTX commit and maybe GC, will switch to emergency mode and use the preallocated buffer in case of space pressure. Signed-off-by: Fan Yong <fan.yong@hpe.com>
5c04177 to
e84a9af
Compare
|
|
||
| if (ext_df != NULL && !UMOFF_IS_NULL(ext_df->ped_emerg_buf) && | ||
| behavior == TX_FAILURE_RETURN) { | ||
| rc = umem_tx_set_snapbuf(umm, ext_df->ped_emerg_buf, VOS_SNAPBUF_EMERG); |
There was a problem hiding this comment.
Are you sure this buffer won’t be used by more than one ULT thread at the same time?
There was a problem hiding this comment.
There will be no CPU yield during PMEM based VOS transaction. So nobody can share ext_df->ped_emerg_buf until current PMEM TX committed or aborted. It is harmless if the buffer is used by others after current PMEM TX committed, right?
There was a problem hiding this comment.
It is harmless if the buffer is used by others after current PMEM TX committed, right?
Yes.
Signed-off-by: Fan Yong <fan.yong@hpe.com>
Signed-off-by: Fan Yong <fan.yong@hpe.com>
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18039/7/testReport/ |
test_dfuse_daos_build_wb failed for DAOS-18813, not related with the patch. |
| if (rc != 0) | ||
| goto out; | ||
|
|
||
| umem_tx_set_failure_behavior(umm, TX_FAILURE_RETURN); |
There was a problem hiding this comment.
From my understanding, the umem_tx_begin() + umem_tx_set_failure_behavior(TX_FAILURE_RETURN) pair appears in all 4 DTX entry points that use vos_dtx_add_ptr(). Combining them into a helper should eliminates the risk of forgetting the second call (which would silently disable the emergency buffer fallback under space pressure).
If I am correct, introducing this new helper could make some sense:
/* Begin a DTX PMEM transaction with TX_FAILURE_RETURN mode, required
* for the emergency undo-log buffer fallback in vos_dtx_add_ptr(). */
static inline int
vos_dtx_tx_begin(struct umem_instance *umm)
{
int rc;
rc = umem_tx_begin(umm, NULL);
if (rc == 0)
umem_tx_set_failure_behavior(umm, TX_FAILURE_RETURN);
return rc;
}Then, the 4 following calls would become:
/* vos_dtx_commit() */
rc = vos_dtx_tx_begin(vos_cont2umm(cont));
/* vos_dtx_abort_internal() */
rc = vos_dtx_tx_begin(umm);
/* vos_dtx_set_flags() */
rc = vos_dtx_tx_begin(umm);
/* dtx_blob_aggregate() */
rc = vos_dtx_tx_begin(umm);More over, this also makes the D_ASSERT in vos_dtx_add_ptr() a stronger guarantee: any caller not using vos_dtx_tx_begin() would be immediately caught.
There was a problem hiding this comment.
I will refresh the patch for that.
| int rc; | ||
| int i; | ||
|
|
||
| /* |
There was a problem hiding this comment.
Fully agree that a full scan of all containers to find the globally oldest blob would be too costly. Do you think a bounded scan with an early exit could be acceptable? Something like the following — capped at MAX_CONTAINERS_TO_SCAN — would keep the cost effectively O(1) while still covering the common case where a viable victim exists nearby:
#define MAX_CONTAINERS_TO_SCAN 4
/* Fallback for vos_dtx_reuse_cmt_blob() when the current container
* has at most one committed blob. Scans at most MAX_CONTAINERS_TO_SCAN
* peer containers on the same pool (all xstream-local, no locking needed).
* Best-effort: may miss victims beyond the scan limit.
*/
int vos_dtx_steal_cmt_blob(struct vos_container *cont) {
struct vos_pool *pool = cont->vc_pool;
struct vos_container *victim;
int scanned = 0;
pool_for_each_container(pool, victim) {
if (victim == cont)
continue;
if (scanned++ >= MAX_CONTAINERS_TO_SCAN)
break;
if (victim->cd_dtx_committed_head != victim->cd_dtx_committed_tail)
return vos_dtx_reuse_cmt_blob(victim);
}
return -DER_NOSPACE;
}One drawback is that iteration always starts from the list head, so the same containers are checked first on every call (potential fairness issue). A round-robin starting point would improve this, though I am not sure it is worth the added complexity.
There was a problem hiding this comment.
OK, I will optimize the algorithm for the victim a bit, maybe not the same as your method.
There was a problem hiding this comment.
Nothing mandatory, just wanted to have your opinion on this for being sure that I have properly understood.
| } | ||
| } | ||
|
|
||
| /* |
There was a problem hiding this comment.
From my side, it took me some time to understand why we use PARTIAL_COMMITTED + retry instead of just reverting the partial commit.
If I understand correctly, the remote participants commit before the leader, so by the time the leader's local commit partially fails, the data is already visible on remote nodes. Thus, aborting an already-committed DTX there would corrupt it?
For contributors less familiar with the DTX commit protocol, like me, an expanded comment could be helpful:
/*
* Remote participants committed before the leader (see ordering comment above).
* If the leader's local commit partially fails (e.g., -DER_NOSPACE), reverting
* remote participants is not possible: aborting an already-committed DTX would
* corrupt data visible on those nodes. Instead, mark all entries in this batch
* as PARTIAL_COMMITTED so the next batched commit retries them. Re-committing
* an already-committed DTX is always safe.
*/This is just a suggestion to improve readability: no issue to keep as-is if you think it is not needed.
There was a problem hiding this comment.
Currently, there is no way to revert partial commit, since we do not know whether someone has already read related partial committed data on related targets. I will add some comment to make things more clear.
There was a problem hiding this comment.
Thanks, really think it will help new comer such as me.
| return TX_FAILURE_RETURN; | ||
| default: | ||
| D_ASSERTF(0, "Unknown TX failure behavior %d\n", behavior); | ||
| return -DER_INVAL; |
There was a problem hiding this comment.
No return after unconditional assert.
| return -DER_INVAL; |
There was a problem hiding this comment.
Some static analysis tools may warning as "miss return" or similar for such case.
| /* Memory file size for md-on-ssd phase2 pool */ | ||
| uint64_t ped_mem_sz; | ||
| /* emergency buffer for GC */ | ||
| umem_off_t ped_emerg_buf; |
There was a problem hiding this comment.
Have you considered adding an ability to upgrade already existing pools or is it unlikely a feature like this would be actually helpful?
|
|
||
| if (ext_df != NULL && !UMOFF_IS_NULL(ext_df->ped_emerg_buf) && | ||
| behavior == TX_FAILURE_RETURN) { | ||
| rc = umem_tx_set_snapbuf(umm, ext_df->ped_emerg_buf, VOS_SNAPBUF_EMERG); |
There was a problem hiding this comment.
It is harmless if the buffer is used by others after current PMEM TX committed, right?
Yes.
|
|
||
| rc1 = dtx_commit_large(cont->sc_hdl, (struct dtx_id *)(din->di_dtx_array.ca_arrays), | ||
| din->di_dtx_array.ca_count, false, NULL); | ||
| /* The count of DTX entries will not exceed DTX_THRESHOLD_COUNT. */ |
There was a problem hiding this comment.
| /* The count of DTX entries will not exceed DTX_THRESHOLD_COUNT. */ | |
| D_ASSERT(din->di_dtx_array.ca_count <= DTX_THRESHOLD_COUNT); |
There was a problem hiding this comment.
We cannot directly assertion check the value that is from network. If the sender offered invalid value, in spite of by wrong or intentionally, then dtx_commit_large() will handle that.
| /* | ||
| * Space is almost exhausted. Under such case, we must reclaim space to make current | ||
| * DTX commit to be proceed; otherwise, uncommitted DTX may block VOS aggregation as | ||
| * to prevent further space release. The most direct approach is to reclaim some old | ||
| * blob from current container's committed DTX table. It may be unfair because other | ||
| * containers could hold older committed DTX entries. However, it maybe not worth to | ||
| * scan all pools/containers on the target to find the globally oldest committed DTX | ||
| * blob under space pressure. For now, we select current container as the victim. | ||
| * | ||
| * This can be optimized later. DAOS-18690. | ||
| */ |
There was a problem hiding this comment.
Nitpick. This comment seems to apply to the whole vos_dtx_reuse_cmt_blob() function now. Considering it length it would be good to move it there.
There was a problem hiding this comment.
I will re-implement vos_dtx_extend_cmt_table() and fix all related issues related with comment, code style, log message.
| tail = umem_off2ptr(umm, cont_df->cd_dtx_committed_tail); | ||
| D_ASSERT(tail != NULL); | ||
|
|
||
| dbd = umem_off2ptr(umm, head->dbd_next); |
There was a problem hiding this comment.
dbd and dbd_off are not related despite sharing the dbd part of their name which makes following this function quite confusing to me. Can you please make it straight?
| rc = vos_dtx_add_ptr(cont->vc_pool, head, DTX_CMT_BLOB_SIZE); | ||
| if (rc != 0) | ||
| goto out; | ||
|
|
||
| /* dbd_next is next to dbd_prev */ | ||
| rc = vos_dtx_add_ptr(cont->vc_pool, &head->dbd_prev, | ||
| sizeof(head->dbd_prev) + sizeof(head->dbd_next)); | ||
| if (rc != 0) | ||
| goto out; |
There was a problem hiding this comment.
It looks like these two overlaps and the first one seems a little bit excesive. Or am I missing something?
There was a problem hiding this comment.
Right, I will fix it.
|
|
||
| if (count > 0) { | ||
| D_ASSERTF(cont->vc_dtx_committed_count >= count, | ||
| "Unexpected committed DTX entries count for " DF_UUID ": %u/%u\n", |
There was a problem hiding this comment.
| "Unexpected committed DTX entries count for " DF_UUID ": %u/%u\n", | |
| "Unexpected committed DTX entries count for " DF_UUID ": %"PRIu32"/%"PRIu32"\n", |
| DP_UUID(cont->vc_id), cont->vc_dtx_committed_count, count); | ||
|
|
||
| cont->vc_dtx_committed_count -= count; | ||
| cont->vc_pool->vp_dtx_committed_count -= count; |
There was a problem hiding this comment.
I think the counter on the pool level could use the same assert as you wrote for the container level counter.
There was a problem hiding this comment.
pool->vp_dtx_committed_count will be always not less than cont->vc_dtx_committed_count, so such check is redundant.
| /* Current @head will be reused, move it after @tail, @dbd will be the new head. */ | ||
|
|
||
| dbd->dbd_prev = head->dbd_prev; | ||
| head->dbd_prev = cont_df->cd_dtx_committed_tail; | ||
| tail->dbd_next = cont_df->cd_dtx_committed_head; | ||
| cont_df->cd_dtx_committed_head = head->dbd_next; | ||
| head->dbd_next = UMOFF_NULL; | ||
| cont_df->cd_dtx_committed_tail = dbd_off; |
There was a problem hiding this comment.
IMHO how it stand right now is rather hard to follow. Can we break it into two steps like this:
| /* Current @head will be reused, move it after @tail, @dbd will be the new head. */ | |
| dbd->dbd_prev = head->dbd_prev; | |
| head->dbd_prev = cont_df->cd_dtx_committed_tail; | |
| tail->dbd_next = cont_df->cd_dtx_committed_head; | |
| cont_df->cd_dtx_committed_head = head->dbd_next; | |
| head->dbd_next = UMOFF_NULL; | |
| cont_df->cd_dtx_committed_tail = dbd_off; | |
| /* Move current @head after @tail. */ | |
| head->dbd_next = UMOFF_NULL; | |
| head->dbd_prev = cont_df->cd_dtx_committed_tail; | |
| tail->dbd_next = cont_df->cd_dtx_committed_head; | |
| cont_df->cd_dtx_committed_tail = cont_df->cd_dtx_committed_head; | |
| /* Make @dbd the new head. */ | |
| dbd->dbd_prev = UMOFF_NULL; | |
| cont_df->cd_dtx_committed_head = dbd_off; /* XXX assuming dbd and dbd_off are related */ |
That is good point. We prepare to land a small patch of reserving buffer for space emergency to 2.8, then we do not need to consider handing existing pool without such extension when upgrade from 2.8 to 2.8.x/master. As for the pool from 2.6, there are too much pool layout difference, such as 2.6 pool does not has the whole |
So this patch will be split into two parts, the small part is #18137 |
The small part has already been landed to master. Another part is #18141. This one will be closed. |
If we cannot normally allocate space to hold committed DTX table, then release some old DTX entries from current container to hold new committed ones.
The patch also preallocates some space for TX snapshots. Related logic, such as DTX commit and maybe GC, will switch to emergency mode and use the preallocated buffer in case of space pressure.
Steps for the author:
After all prior steps are complete: