Skip to content

Conversation

@sundb
Copy link
Owner

@sundb sundb commented Feb 14, 2025

based on #372

yveslb and others added 5 commits February 13, 2025 08:42
…ext (redis#13794)

Help text modified for -i, --count, --pattern.
…cond (redis#13793)

```
if (server.aof_fsync == AOF_FSYNC_EVERYSEC &&
    server.aof_last_incr_fsync_offset != server.aof_last_incr_size &&
    server.mstime - server.aof_last_fsync >= 1000 &&
    !(sync_in_progress = aofFsyncInProgress())) {
    goto try_fsync;
```
In redis#12622, when when
appendfsync=everysecond, if redis has written some data to AOF but not
`fsync`, and less than 1 second has passed since the last `fsync `,
redis will won't fsync AOF, but we will update `
fsynced_reploff_pending`, so it cause the `WAITAOF` to return
prematurely.

this bug is introduced in redis#12622,
from 7.4

The bug fix
redis@1bd6688
is just as follows:
```diff
diff --git a/src/aof.c b/src/aof.c
index 8ccd8d8..521b304 100644
--- a/src/aof.c
+++ b/src/aof.c
@@ -1096,8 +1096,11 @@ void flushAppendOnlyFile(int force) {
              * in which case master_repl_offset will increase but fsynced_reploff_pending won't be updated
              * (because there's no reason, from the AOF POV, to call fsync) and then WAITAOF may wait on
              * the higher offset (which contains data that was only propagated to replicas, and not to AOF) */
-            if (!sync_in_progress && server.aof_fsync != AOF_FSYNC_NO)
+            if (server.aof_last_incr_fsync_offset == server.aof_last_incr_size &&
+                !(sync_in_progress = aofFsyncInProgress()))
+            {
                 atomicSet(server.fsynced_reploff_pending, server.master_repl_offset);
+            }
             return;
```
Additionally, we slightly refactored fsync AOF to make it simpler, as
redis@584f008
The reason why master sends PING is to keep the connection with replica
active, so master need not send PING to replicas if already sent
replication stream in the past `repl_ping_slave_period` time.

Now master only sends PINGs and increases `master_repl_offset` if there
is no traffic, so this PR also can reduce the impact of issue in
redis#13773, of course, does not resolve
it completely.
> Fix ping causing inconsistency between AOF size and replication offset
in the future PR. Because we increment the replication offset when
sending PING/REPLCONF to the replica but do not write data to the AOF
file, this might cause the starting offset of the AOF file plus its size
to be inconsistent with the actual replication offset.
### Background 
AOF is often used as an effective data recovery method, but now if we
have two AOFs from different nodes, it is hard to learn which one has
latest data. Generally, we determine whose data is more up-to-date by
reading the latest modification time of the AOF file, but because of
replication delay, even if both master and replica write to the AOF at
the same time, the data in the master is more up-to-date (there are
commands that didn't arrive at the replica yet, or a large number of
commands have accumulated on replica side ), so we may make wrong
decision.

### Solution
The replication offset always increments when AOF is enabled even if
there is no replica, we think replication offset is better method to
determine which one has more up-to-date data, whoever has a larger
offset will have newer data, so we add the start replication offset info
for AOF, as bellow.
```
file appendonly.aof.2.base.rdb seq 2 type b
file appendonly.aof.2.incr.aof seq 2 type i startoffset 224
```
And if we close gracefully the AOF file, not a crash, such as
`shutdown`, `kill signal 15` or `config set appendonly no`, we will add
the end replication offset, as bellow.
```
file appendonly.aof.2.base.rdb seq 2 type b
file appendonly.aof.2.incr.aof seq 2 type i startoffset 224 endoffset 532
```

#### Things to pay attention to
- For BASE AOF, we do not add `startoffset` and `endoffset` info, since
we could not know the start replication replication of data, and it is
useless to help us to determine which one has more up-to-date data.
- For AOFs from old version, we also don't add `startoffset` and
`endoffset` info, since we also don't know start replication replication
of them. If we add the start offset from 0, we might make the judgment
even less accurate. For example, if the master has just rewritten the
AOF, its INCR AOF will inevitably be very small. However, if the replica
has not rewritten AOF for a long time, its INCR AOF might be much
larger. By applying the following method, we might make incorrect
decisions, so we still just check timestamp instead of adding offset
info
- If the last INCR AOF has `startoffset` or `endoffset`, we need to
restore `server.master_repl_offset` according to them to avoid the
rollback of the `startoffset` of next INCR AOF. If it has `endoffset`,
we just use this value as `server.master_repl_offset`, and a very
important thing is to remove this information from the manifest file to
avoid the next time we load the manifest file with wrong `endoffset`. If
it only has `startoffset`, we calculate `server.master_repl_offset` by
the `startoffset` plus the file size.

### How to determine which one has more up-to-date data
If AOF has a larger replication offset, it will have more up-to-date
data. The following is how to get AOF offset:

Read the AOF manifest file to obtain information about **the last INCR
AOF**
1. If the last INCR AOF has `endoffset` field, we can directly use the
`endoffset` to present the replication offset of AOF
2. If there is no `endoffset`(such as redis crashes abnormally), but
there is `startoffset` filed of the last INCR AOF, we can get the
replication offset of AOF by `startoffset` plus the file size
3. Finally, if the AOF doesn’t have both `startoffset` and `endoffset`,
maybe from old version, and new version redis has not rewritten AOF yet,
we still need to check the modification timestamp of the last INCR AOF

### TODO
Fix ping causing inconsistency between AOF size and replication
offset in the future PR. Because we increment the replication offset
when sending PING/REPLCONF to the replica but do not write data to the
AOF file, this might cause the starting offset of the AOF file plus its
size to be inconsistent with the actual replication offset.
MEMORY USAGE on a List samples quicklist entries, but does not account
to how many elements are in each sampled node. This can skew the
calculation when the sampled nodes are not balanced.
The fix calculate the average element size in the sampled nodes instead
of the average node size.
@codecov
Copy link

codecov bot commented Feb 14, 2025

Codecov Report

Attention: Patch coverage is 89.48546% with 47 lines in your changes missing coverage. Please review.

Project coverage is 69.04%. Comparing base (dd40ed1) to head (6cd97f8).
Report is 12 commits behind head on incrementail_defrag_module.

Files with missing lines Patch % Lines
src/module.c 2.77% 35 Missing ⚠️
src/t_hash.c 97.09% 10 Missing ⚠️
src/aof.c 96.22% 2 Missing ⚠️
Additional details and impacted files
@@                      Coverage Diff                       @@
##           incrementail_defrag_module     #373      +/-   ##
==============================================================
+ Coverage                       69.01%   69.04%   +0.03%     
==============================================================
  Files                             118      118              
  Lines                           68048    68401     +353     
==============================================================
+ Hits                            46962    47228     +266     
- Misses                          21086    21173      +87     
Files with missing lines Coverage Δ
src/commands.def 100.00% <ø> (ø)
src/defrag.c 86.58% <100.00%> (ø)
src/evict.c 96.62% <100.00%> (ø)
src/object.c 78.91% <100.00%> (+0.02%) ⬆️
src/redis-cli.c 44.36% <ø> (+0.06%) ⬆️
src/replication.c 86.99% <100.00%> (+0.13%) ⬆️
src/server.c 88.91% <100.00%> (+0.01%) ⬆️
src/server.h 100.00% <ø> (ø)
src/aof.c 81.67% <96.22%> (+0.61%) ⬆️
src/t_hash.c 95.41% <97.09%> (+0.15%) ⬆️
... and 1 more

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tezc added 2 commits February 14, 2025 17:13
This PR adds three new hash commands: HGETDEL, HGETEX and HSETEX. These
commands enable user to do multiple operations in one step atomically
e.g. set a hash field and update its TTL with a single command.
Previously, it was only possible to do it by calling hset and hexpire
commands subsequently.

- **HGETDEL command**

  ```
  HGETDEL <key> FIELDS <numfields> field [field ...]
  ```
  
  **Description**  
  Get and delete the value of one or more fields of a given hash key
  
  **Reply**  
Array reply: list of the value associated with each field or nil if the
field doesn’t exist.

- **HGETEX command**

  ```
   HGETEX <key>  
[EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT
unix-time-milliseconds | PERSIST]
     FIELDS <numfields> field [field ...]
  ```

  **Description**
Get the value of one or more fields of a given hash key, and optionally
set their expiration

  **Options:**
  EX seconds: Set the specified expiration time, in seconds.
  PX milliseconds: Set the specified expiration time, in milliseconds.
EXAT timestamp-seconds: Set the specified Unix time at which the field
will expire, in seconds.
PXAT timestamp-milliseconds: Set the specified Unix time at which the
field will expire, in milliseconds.
  PERSIST: Remove the time to live associated with the field.

  **Reply** 
Array reply: list of the value associated with each field or nil if the
field doesn’t exist.

- **HSETEX command**

  ```
  HSETEX <key>
     [FNX | FXX]
[EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT
unix-time-milliseconds | KEEPTTL]
     FIELDS <numfields> field value [field value...]
  ```
  **Description**
Set the value of one or more fields of a given hash key, and optionally
set their expiration

  **Options:**
  FNX: Only set the fields if all do not already exist.
  FXX: Only set the fields if all already exist.

  EX seconds: Set the specified expiration time, in seconds.
  PX milliseconds: Set the specified expiration time, in milliseconds.
EXAT timestamp-seconds: Set the specified Unix time at which the field
will expire, in seconds.
PXAT timestamp-milliseconds: Set the specified Unix time at which the
field will expire, in milliseconds.
  KEEPTTL: Retain the time to live associated with the field.

  
Note: If no option is provided, any associated expiration time will be
discarded similar to how SET command behaves.

  **Reply**
  Integer reply: 0 if no fields were set
  Integer reply: 1 if all the fields were set
Remove DENYOOM flag from hexpire / hexpireat / hpexpire / hpexpireat
commands.

h(p)expire(at) commands may allocate some memory but it is not that big.
Similary, we don't have DENYOOM flag for EXPIRE command. This change
will align EXPIRE and HEXPIRE commands in this manner.
src/module.c Outdated
Comment on lines 13932 to 13933
/* Defrag a RedisModuleDict previously allocated by RM_CreateDict. */
RedisModuleDict *RM_DefragRedisModuleDict(RedisModuleDefragCtx *ctx, RedisModuleDict *dict) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we only defrag the dict (rax) nodes memory, but we don't use this opportunity for the user to get callbacks an defrag the data he has there.
i.e. he may need to iterate on the dict himself, something which maybe they're already doing.
in that sense, it's a little bit less efficient, but maybe easier to integrate into existing versions that defrag the values in the dict and are missing a way to defrag the dict memory itself.
on the other hand, since modules didn't really use that defrag feature widely, and when they'll start using our new API, they can adapt and abandon the other iteration.

regardless, we may also need to be able to defrag the dict incrementally (imagine one with 2 million elements), even if we defrag just the rax itself, we may need to allow defragging it in steps, and i'm unsure if we can use the long cursor to keep track of the position (key name that can be moved / removed, so we can't keep a pointer, we need to copy the string).

Copy link

@oranagra oranagra Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/* callback called per key in the dict, return value is non-NULL if the input value pointer was moved */
typedef void *(*DictDefragValueCB)(void* value, void *key, size_t keylen);

/* calls the CB per value in the dict.
starts from the seekTo key, or beginning when NULL.
returns the next key to scan, or sets to NULL when done.
returns a new dict if it was re-allocated. (will only be done when seekTo is NULL) */
RedisModuleDict *RM_DefragRedisModuleDict(RedisModuleDefragCtx *ctx, RedisModuleDict *dict, DictDefragValueCB, 
 valueCB, RedisModuleString seekTo, RedisModuleString *nextToSeek);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to handle large dictionaries, internally the RM_DefragRedisModuleDict will run a loop and call RM_DefragShouldStop to know when to stop).
But since RM_DefragCursorSet only supports integer cursor, and on the other hand, we don't run several defrags in parallel, the module will have to store the last used cursor (key-name) in a global. and pass 0 or 1 to RM_DefragCursorSet, just as an indication if we're starting a new sweep, or resuming from a previous one.

i.e. when it gets a non-zero cursor from redis, it uses the previous key in the global for seekTo, and when done returns 0 or 1 depending on the value from nextToSeek.

it's rather ugly, as long as we don't wanna abandon the numeric cursors. and even if we do move to non-numeric cursors, the module may have more state to save other than a single dict's nextToSeek key. e.g. what if the module is defragging several dicts (either as globals, or inside the keyspace key), it'll need to remember which dict to resume from as an additional integer besides the key name)

@ephraimfeldblum @MeirShpilraien please take a look and comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's rather ugly, as long as we don't wanna abandon the numeric cursors. and even if we do move to non-numeric cursors, the module may have more state to save other than a single dict's nextToSeek key. e.g. what if the module is defragging several dicts (either as globals, or inside the keyspace key), it'll need to remember which dict to resume from as an additional integer besides the key name)

I believe that the cursor should be change (create new API maybe) to void* which is basically a module private data that will be given to the module back on the next defrag call.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MeirShpilraien If it's void* then the caller has no way of knowing how to free this cursor, right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that if a void* cursor was given then Redis need to call the defrag again. When the module done he can free the cursor and indicate Redis that it is done. Am I missing something?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so in our test, we need both an index (dict index / stage) and a string (keyname for the rax).
we'll need to allocate a struct on the heap to hold both, give redis a pointer to that struct, and a callback so that it can be freed.
considering we don't require to maintain more than one "cursor", using globals isn't too bad.
it's just the numeric cursor that's ugly (if it were a boolean, it'll look nicer).
do you have any other thoughts, or suggestions?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MeirShpilraien I think your suggestion makes sense, but since we do not really need it now, maybe we can go with the simpler approach. We can always extend it in the future to be more generic. WDYT?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suppose it'll mean to add another variant for RM_DefragCursorSet, one that takes string or a pointer with a free callback.
but as far as i can tell, this won't contradict the design of the RM_DefragRedisModuleDict API, it'll remain the same (future compatible 😉).
is that right?

@sundb sundb force-pushed the DefragRedisModuleDictDict branch from 951cefa to 2b2e4e2 Compare February 17, 2025 14:27
sundb and others added 3 commits February 19, 2025 09:31
closes redis#13797, just fix syntax
issue in comments instead of real code.
redis#13804)

the `dictGetSignedIntegerVal` function should be used here,
because in some cases (especially on 32-bit systems) long may
be 4 bytes, and the ttl time saved in expires is a unix timestamp
(millisecond value), which is more than 4 bytes. In this case, we may
not be able to get the correct idle time, which may cause eviction
disorder, in other words, keys that should be evicted later may be
evicted earlier.
guybe7 and others added 3 commits February 19, 2025 15:04
In case a replica connection was closed mid-RDB, we should not send a \n
to that replica, otherwise, it may reach the replica BEFORE it realizes
that the RDB transfer failed, causing it to treat the \n as if it was
read from the RDB stream
typedef void (*RedisModuleDefragFunc)(RedisModuleDefragCtx *ctx);
typedef int (*RedisModuleDefragFunc2)(RedisModuleDefragCtx *ctx);
typedef void (*RedisModuleUserChangedFunc) (uint64_t client_id, void *privdata);
typedef void *(*RedisModuleDefragDictValueCallback)(void *data, unsigned char *key, size_t keylen);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we need to pass the key and keylen as parameters to the callback. But there's no harm in them existing for other users.

However I am fairly certain that we do need to add a RedisModuleDefragCtx *ctx argument. Without which, it is impossible to call other defrag functions from within the CB, eg RedisModule_DefragAlloc

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However I am fairly certain that we do need to add a RedisModuleDefragCtx *ctx argument. Without which, it is impossible to call other defrag functions from within the CB, eg RedisModule_DefragAlloc

sure, it's reasonable.

return newdict;
}
}
*seekTo = NULL;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't *seekTo be freed here?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, not sure why sanitizer wouldn't find it.

sundb and others added 5 commits February 20, 2025 00:05
This PR is based on: valkey-io/valkey#1462

## Issue/Problems

Duty Cycle: Active Defrag has configuration values which determine the
intended percentage of CPU to be used based on a gradient of the
fragmentation percentage. However, Active Defrag performs its work on
the 100ms serverCron timer. It then computes a duty cycle and performs a
single long cycle. For example, if the intended CPU is computed to be
10%, Active Defrag will perform 10ms of work on this 100ms timer cron.

* This type of cycle introduces large latencies on the client (up to
25ms with default configurations)
* This mechanism is subject to starvation when slow commands delay the
serverCron

Maintainability: The current Active Defrag code is difficult to read &
maintain. Refactoring of the high level control mechanisms and functions
will allow us to more seamlessly adapt to new defragmentation needs.
Specific examples include:

* A single function (activeDefragCycle) includes the logic to
start/stop/modify the defragmentation as well as performing one "step"
of the defragmentation. This should be separated out, so that the actual
defrag activity can be performed on an independent timer (see duty cycle
above).
* The code is focused on kvstores, with other actions just thrown in at
the end (defragOtherGlobals). There's no mechanism to break this up to
reduce latencies.
* For the main dictionary (only), there is a mechanism to set aside
large keys to be processed in a later step. However this code creates a
separate list in each kvstore (main dict or not), bleeding/exposing
internal defrag logic. We only need 1 list - inside defrag. This logic
should be more contained for the main key store.
* The structure is not well suited towards other non-main-dictionary
items. For example, pub-sub and pub-sub-shard was added, but it's added
in such a way that in CMD mode, with multiple DBs, we will defrag
pub-sub repeatedly after each DB.

## Description of the feature

Primarily, this feature will split activeDefragCycle into 2 functions.

1. One function will be called from serverCron to determine if a defrag
cycle (a complete scan) needs to be started. It will also determine if
the CPU expenditure needs to be adjusted.
2. The 2nd function will be a timer proc dedicated to performing defrag.
This will be invoked independently from serverCron.

Once the functions are split, there is more control over the latency
created by the defrag process. A new configuration will be used to
determine the running time for the defrag timer proc. The default for
this will be 500us (one-half of the current minimum time). Then the
timer will be adjusted to achieve the desired CPU. As an example, 5% of
CPU will run the defrag process for 500us every 10ms. This is much
better than running for 5ms every 100ms.

The timer function will also adjust to compensate for starvation. If a
slow command delays the timer, the process will run proportionately
longer to ensure that the configured CPU is achieved. Given the presence
of slow commands, the proportional extra time is insignificant to
latency. This also addresses the overload case. At 100% CPU, if the
event loop slows, defrag will run proportionately longer to achieve the
configured CPU utilization.

Optionally, in low CPU situations, there would be little impact in
utilizing more than the configured CPU. We could optionally allow the
timer to pop more often (even with a 0ms delay) and the (tail) latency
impact would not change.

And we add a time limit for the defrag duty cycle to prevent excessive
latency. When latency is already high (indicated by a long time between
calls), we don't want to make it worse by running defrag for too long.

Addressing maintainability:

* The basic code structure can more clearly be organized around a
"cycle".
* Have clear begin/end functions and a set of "stages" to be executed.
* Rather than stages being limited to "kvstore" type data, a cycle
should be more flexible, incorporating the ability to incrementally
perform arbitrary work. This will likely be necessary in the future for
certain module types. It can be used today to address oddballs like
defragOtherGlobals.
* We reduced some of the globals, and reduce some of the coupling.
defrag_later should be removed from serverDb.
* Each stage should begin on a fresh cycle. So if there are
non-time-bounded operations like kvstoreDictLUTDefrag, these would be
less likely to introduce additional latency.


Signed-off-by: Jim Brunner
[[email protected]](mailto:[email protected])
Signed-off-by: Madelyn Olson
[[email protected]](mailto:[email protected])
Co-authored-by: Madelyn Olson
[[email protected]](mailto:[email protected])

---------

Signed-off-by: Jim Brunner [email protected]
Signed-off-by: Madelyn Olson [email protected]
Co-authored-by: Madelyn Olson [email protected]
Co-authored-by: ShooterIT <[email protected]>
…dis#13815)

## Description

Currently, when performing defragmentation on non-key data within the
module, we cannot process the defragmentation incrementally. This
limitation affects the efficiency and flexibility of defragmentation in
certain scenarios.
The primary goal of this PR is to introduce support for incremental
defragmentation of global module data.

## Interface Change
New module API `RegisterDefragFunc2`

This is a more advanced version of `RM_RegisterDefragFunc`, in that it
takes a new callbacks(`RegisterDefragFunc2`) that has a return value,
and can use RM_DefragShouldStop in and indicate that it should be called
again later, or is it done (returned 0).

## Note
The `RegisterDefragFunc` API remains available.

---------

Co-authored-by: ShooterIT <[email protected]>
Co-authored-by: oranagra <[email protected]>
Comment on lines 13987 to 13988
if (newdata)
raxSetData(ri.node, ri.data=newdata);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we better change the test to actually store data in the dict and defrag it?
it would have mean that we'll realize that we're missing a defrag context in the callback.
and maybe, we'll discover other issues (changing the rax during iteration) 🤷

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we better change the test to actually store data in the dict and defrag it?

In the current test we've filled each dictionary with 10 strings, not sure if that's what you mean?

and maybe, we'll discover other issues (changing the rax during iteration)

do you mean deleting rax or remove element from rax in the break of defragmentaion?

btw, I was wondering if we need to add privdata in the callback, is there any possibility that the data in rax is also referenced elsewhere in the module, and when we update it we also need to update the reference elsewhere?

Copy link

@ephraimfeldblum ephraimfeldblum Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we need to add privdata in the callback

I don't see why not. privdata is not currently needed for the dicts used by TS. But there's no harm in future-proofing this against modules that do need it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not arguing against it, but just suggesting to keep it consistent with other similar APIs (if they do or don't have privdata).
technically, since this is single threaded, and the CB is called from an API the module calls (not a hook CB that's used later), the module can always just store that info in a global..

}

while (raxNext(&ri)) {
void *newdata = valueCB(ctx, ri.data, ri.key, ri.key_len);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the CB be made to be optional?

Some dicts in use are actually used as sets, meaning they have no value. In such a case, there is no need to provide a valueCB to defrag them.

void *newdata = valueCB(ctx, ri.data, ri.key, ri.key_len);
if (newdata)
raxSetData(ri.node, ri.data=newdata);
if (RM_DefragShouldStop(ctx)) {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (RM_DefragShouldStop(ctx)) {
if (++iterator++ > 128 && RM_DefragShouldStop(ctx)) {

@oranagra @ephraimfeldblum should we call RM_DefragShouldStop() every N times? i think it's reasonable, reported by @ShooterIT

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. certainly. don't want to excessively call the ustime function.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I don't think that will work for me. On the TS side we have dicts of dicts. If a leaf dict should stop, then we cannot afford to let its branch dict continue beyond that point.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so maybe in addition to that, we need to raise some flag (return value) to cause the recursion to exit?
or maybe we should aim for users to call these method often, but add some mechanism to throttle the rate at which it calls mstime, and at the same time, cache an exit flag to make it sticky.

maybe it can look at increments of stat_active_defrag_misses and stat_active_defrag_hits, and check the time once in some 512 pointers?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oranagra Do you mean that it is possible to get to a leaf and if it returns to stop, we just skip all the leaves behind and stop this entire rax defragmentation?
@ephraimfeldblum Is this what you want? What I understand is that your leaf is also a dict, and you don't want to quit in the middle of defragging this leaf.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, i mean something like this:

int RM_DefragShouldStop(RedisModuleDefragCtx *ctx) {
    if (ctx->stopping)
        return 1;
    if (!ctx->endtime)
        return 0;
    if (server.stat_active_defrag_hits - ctx->last_stop_check_hits < 512 &&
        server.stat_active_defrag_misses - ctx->last_stop_check_misses < 1024) {
        return 0;
    }
    ctx->last_stop_check_hits = server.stat_active_defrag_hits;
    ctx->last_stop_check_misses = server.stat_active_defrag_misses;
    if (ctx->endtime < ustime())
        ctx->stopping = 1;
    return ctx->stopping;
}

Copy link

@ephraimfeldblum ephraimfeldblum Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sundb I can pause in the middle of defragging a leaf (though it's unlikely, since the leaves are sets). But if I do, then the parent dict cannot be allowed to continue, it must also pause immediately at that same point.
@oranagra's solution looks good to me.

@sundb
Copy link
Owner Author

sundb commented Feb 25, 2025

@oranagra @ephraimfeldblum shold we change the key+keylen to a RedisModuleKey, which is more in sync with other API.
like

typedef void (*RedisModuleScanCB)(RedisModuleCtx *ctx, RedisModuleString *keyname, RedisModuleKey *key, void *privdata);
typedef void *(*RedisModuleDefragDictValueCallback)(RedisModuleDefragCtx *ctx, void *data, unsigned char *keyname, size_t keylen);

to

typedef void *(*RedisModuleDefragDictValueCallback)(RedisModuleDefragCtx *ctx, void *data, RedisModuleKey *keyname);

@oranagra
Copy link

@sundb it can't be a RedisModleKey, this is the key name for the dict.
the one from:

int RM_DictSetC(RedisModuleDict *d, void *key, size_t keylen, void *ptr) {

if we wanted, we could have wrapped it with a RedisModuleString, but considering the majority of the cases won't need that key name, it would be wasteful.

@sundb
Copy link
Owner Author

sundb commented Feb 25, 2025

@sundb it can't be a RedisModleKey, this is the key name for the dict. the one from:

sorry, copy mistake, let's keep it.

@sundb sundb closed this Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.