Skip to content

Conversation

@AStepanov25
Copy link
Contributor

@AStepanov25 AStepanov25 commented Jun 24, 2025

Profiling results, here time is average among n iterations, memory is cumulative garbage generated by calling the corresponding method.

More details: https://github.com/research-ag/canister-profiling/tree/list

n = 100000

Time:

method List Refactored
get 205 205
getOpt 253 246
put 253 225
forEach 106 105
reverseForEach 133 112
find 196 127
findIndex 163 127
findLastIndex 203 134
all 175 122
any 163 127
repeat 15 15
addRepeat 16 16
fromArray 164 156
fromVarArray 164 156
toArray 155 155
toVarArray 223 167
toText 446 322
map 169 152
clone 188 116
min 183 166
max 183 166
size 127 127

Memory:

method List Refactored
get 0 0
getOpt 0 0
put 0 0
forEach 8 0
reverseForEach 0 0
find 172 16
findIndex 8 0
findLastIndex 0 0
all 24 0
any 8 0
repeat 408688 408688
addRepeat 406544 406544
fromArray 408716 408688
fromVarArray 408716 408688
toArray 400180 400084
toVarArray 400180 400008
toText 3200164 3199992
map 425036 408688
clone 425032 409672
min 36 0
max 36 0
size 0 0

@AStepanov25 AStepanov25 requested a review from a team as a code owner June 24, 2025 14:57
while (i < blocksCount) {
let oldBlock = list.blocks[i];
let blockSize = oldBlock.size();
let newBlock = VarArray.repeat<?R>(null, blockSize);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would it perform if we used the VarArray.map function on the data blocks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, we have array of options, and if we encounter null we return earlier, this early return can speed up a little bit the whole method.

Second, map uses tabulate, which uses closures, which can be not that efficient and can generate garbage.

Comment on lines +635 to +644
let (a, b) = do {
let i = Nat32.fromNat(index);
let lz = Nat32.bitcountLeadingZero(i);
let lz2 = lz >> 1;
if (lz & 1 == 0) {
(Nat32.toNat(((i << lz2) >> 16) ^ (0x10000 >> lz2)), Nat32.toNat(i & (0xFFFF >> lz2)))
} else {
(Nat32.toNat(((i << lz2) >> 15) ^ (0x18000 >> lz2)), Nat32.toNat(i & (0x7FFF >> lz2)))
}
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this inlining is worth it. We're saving only a function call, or?

If the gain is so small then readability is probably worth more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put and getOpt along with get are the most used methods of List, so they should be optimized at any cost. Anyway there a lot of code duplication there.

src/List.mo Outdated
Comment on lines 663 to 672
let (a, b) = do {
let i = Nat32.fromNat(index);
let lz = Nat32.bitcountLeadingZero(i);
let lz2 = lz >> 1;
if (lz & 1 == 0) {
(Nat32.toNat(((i << lz2) >> 16) ^ (0x10000 >> lz2)), Nat32.toNat(i & (0xFFFF >> lz2)))
} else {
(Nat32.toNat(((i << lz2) >> 15) ^ (0x18000 >> lz2)), Nat32.toNat(i & (0x7FFF >> lz2)))
}
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

src/List.mo Outdated
Comment on lines 917 to 921
if (predicate(x)) return ?size<T>({
var blocks = [var];
var blockIndex = blockIndex;
var elementIndex = elementIndex
})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth defining a size_ (internal) function that takes blockIndex, elementIndex as arguments and that the public size() can use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +1861 to +1863
let blocks1 = list1.blocks;
let blocks2 = list2.blocks;
let blockCount = Nat.min(blocks1.size(), blocks2.size());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part worth it? The size calculation only happens once for the whole list. Cost does not depend on length of inputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear what do you mean. Index blocks sizes can be different even if sizes of lists are equal.

Comment on lines +63 to +67
public func singleton<T>(element : T) : List<T> = {
var blockIndex = 2;
var blocks = [var [var], [var ?element]];
var elementIndex = 0
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if worth the optimization. Does not depend on length of list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be initializations in cycle, if the data structure is List<List<T>> for example.

Copy link

@timohanke timohanke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions.

Great PR which offers some substantial optimizations.

Generally, I think optimizations are worth it whenever we walk through an entire list which is the case for most functions that are changed. However, in some (few) cases we are only optimizing something that does not get called repeatedly. That is, the optimization does not depend on the length of the list. In those cases I would not do it if it makes code harder to read.

I found some case of this where the locate() or size() function was inlined. Also in the singleton function. Maybe others that I overlooked?

@timohanke
Copy link

Not all improved functions are visible in the profiling table. For those, it is hard to tell from just the code diff to tell how much the improvement is and if it worth the inlining or not.

@AStepanov25
Copy link
Contributor Author

Not all improved functions are visible in the profiling table. For those, it is hard to tell from just the code diff to tell how much the improvement is and if it worth the inlining or not.

I profiled only required functions, for others performance increase is obvious as the code similar to others functions implementations.

What functions are you interested in, I'll add profiling if needed.

@timohanke
Copy link

Ready to merge now from my perspective.

timohanke
timohanke previously approved these changes Sep 29, 2025
@github-actions github-actions bot dismissed timohanke’s stale review October 4, 2025 14:07

Review dismissed by automation script.

@Andrei1998
Copy link
Contributor

Very cool to see this change, thank you! I just ran the benchmarks for the new PriorityQueue (which heavily depends on List) and the present change brings consistent improvements in instruction counts in the -4% to -8% range (below are the benchmarks in bench/PriorityQueues.bench.mo). There is only a very small regression in Garbage Collection.

Instructions (Operations)

A) PriorityQueue (Old List) A) PriorityQueue (New List) Δ% B) PriorityQueueSet
1.) 100000 operations (push:pop = 1:1) 597_528_283 568_913_057 -4.8% 522_729_861
2.) 100000 operations (push:pop = 2:1) 742_952_999 707_495_424 -4.8% 809_693_415
3.) 100000 operations (push:pop = 10:1) 357_911_737 336_409_578 -6.0% 873_181_028
4.) 100000 operations (only push) 192_422_882 176_982_954 -8.0% 886_824_792
5.) 50000 pushes, then 50000 pops 776_632_572 745_226_615 -4.0% 961_776_534
6.) 50000 pushes, then 25000 "pop;push"es 529_475_053 504_254_228 -4.8% 922_137_111

Heap (likely broken at the moment)

A) PriorityQueue (Old List) A) PriorityQueue (New List) Δ% B) PriorityQueueSet
1.) 100000 operations (push:pop = 1:1) 272 B 272 B 0% 272 B
2.) 100000 operations (push:pop = 2:1) 272 B 272 B 0% 272 B
3.) 100000 operations (push:pop = 10:1) 272 B 272 B 0% 272 B
4.) 100000 operations (only push) 272 B 272 B 0% 272 B
5.) 50000 pushes, then 50000 pops 272 B 272 B 0% 272 B
6.) 50000 pushes, then 25000 "pop;push"es 272 B 272 B 0% 272 B

Garbage Collection

A) PriorityQueue (Old List) A) PriorityQueue (New List) Δ% B) PriorityQueueSet
1.) 100000 operations (push:pop = 1:1) 15.03 MiB 15.07 MiB +0.3% 17.43 MiB
2.) 100000 operations (push:pop = 2:1) 19.73 MiB 19.73 MiB 0% 19.32 MiB
3.) 100000 operations (push:pop = 10:1) 8.67 MiB 8.67 MiB 0% 12.64 MiB
4.) 100000 operations (only push) 3.87 MiB 3.87 MiB 0% 9.96 MiB
5.) 50000 pushes, then 50000 pops 22.03 MiB 22.03 MiB 0% 26.20 MiB
6.) 50000 pushes, then 25000 "pop;push"es 14.22 MiB 14.22 MiB 0% 18.44 MiB

@timohanke
Copy link

Which List operations are used in the test for which you observed garbage increase?

@Andrei1998
Copy link
Contributor

Andrei1998 commented Oct 6, 2025

We use the same operations in all benchmarks except number 4.) [which only uses PriorityQueue.push, all others do PriorityQueue.pop as well]:

  • PriorityQueue.push uses List.add, List.size, List.put, List.at.
  • PriorityQueue.pop uses List.removeLast, List.size, List.isEmpty, List.put, List.at, List.get.

The list always starts empty as follows:

let priorityQueue = PriorityQueue.empty<Nat>();

Yet interestingly, the anomaly is for number 1.) [not number 4.)].

Number 1.) basically consists of a random sequence of PriorityQueue.push and PriorityQueue.pop, with an equal probability of each entry being a push or a pop. The same anomaly does not show for 2.), where push is twice as likely as pop. The length of the underlying List in 1.) and 2.) is a $\pm 1$ random walk, with the $+1$ and $-1$ probabilities being either $0.5-0.5$ or $0.66-0.33$. Number 3.) also does the same thing, just with an even higher probability to increase the length of the list. My current hypothesis is that what is special about 1.) is that it grows and shrinks the List many times, in contrast to the others, which mostly grow it, or grow and shrink it only once. Hence, this could point to something like a memory leak (maybe not exactly, but going in that direction, i.e., being reallocation related).

@timohanke
Copy link

The garbage increased by 0.4 bytes per operation. I wonder if the garbage increasing can actually mean that the code got better (for example freeing more after a pop)?

How often do you think the random walk hits length 0? I also wonder if it can be related to how much is freed when reaching the empty list again.

@Andrei1998
Copy link
Contributor

Andrei1998 commented Oct 6, 2025

The garbage increased by 0.4 bytes per operation. I wonder if the garbage increasing can actually mean that the code got better (for example freeing more after a pop)?

This could also be the case, indeed. It could be that we used to have extra data lying around for no good reason and we are now deallocating it more eagerly. Alternatively, it could be that we now also allocate more data, and hence also need to free more. It might also be a bit of both, for organic reasons, as seen next. Technically, if someone told us after a pop that the queue is gonna grow again (which in this benchmark should happen a lot), then it would be better to not deallocate memory eagerly, because the queue will grow again later. So, in general, it's a tradeoff: deallocate too eagerly and then reallocate later when the queue grows again, or deallocate less aggressively and risk no more operations coming in the future (and hence waste space long-term). So, a good working hypothesis is that we now deallocate more, and hence also need to reallocate more when the queue grows back.

How often do you think the random walk hits length 0? I also wonder if it can be related to how much is freed when reaching the empty list again.

Digging into this further might take more effort, but at least a quick and dirty experiment, written with the help of LLMs, shows that we should hit 0 around 500 times (the LLM computed around 504 when asked to do the math directly). Some of those 500 were probably already close to 0, but the expected maximum length of the list during the test should grow roughly with the square root of the number of operations. Of course, I could have just run the test with the fixed seed from the benchmark, but this should at least shed some preliminary light on the behavior.

rvanasa
rvanasa previously approved these changes Oct 13, 2025
Copy link
Collaborator

@rvanasa rvanasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any remaining work for this PR? Happy to merge once ready.

@github-actions github-actions bot dismissed rvanasa’s stale review October 18, 2025 14:10

Review dismissed by automation script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants