.NET 9 Preview 7 includes several new runtime features. We focused on the following areas:
- ARM64 SVE Support
- Post-Indexed Addressing on ARM64
- Strength Reduction in Loops
- Object Stack Allocation for Boxes
- GC Dynamic Adaptation To Application Sizes
Runtime updates in .NET 9 Preview 7:
- Release notes
- What's new in the .NET Runtime in .NET 9 documentation
.NET 9 Preview 7:
.NET 9 introduces experimental support for the Scalable Vector Extension (SVE), a SIMD instruction set for ARM64 CPUs. .NET already supports the NEON instruction set, so on NEON-capable hardware, your applications can leverage 128-bit vector registers. SVE supports flexible vector lengths all the way up to 2048 bits, unlocking more data processing per instruction; in .NET 9, System.Numerics.Vector<T>
is 128 bits wide when targeting SVE, and future work will enable scaling of its width to match the target machine's vector register size. You can accelerate your .NET applications on SVE-capable hardware using the new System.Runtime.Intrinsics.Arm.Sve
APIs. To learn more about SVE in .NET, check out dotnet/runtime #93095, and to track which APIs are completed, check out dotnet/runtime #99957.
Note that SVE support in .NET 9 is experimental. The APIs under System.Runtime.Intrinsics.Arm.Sve
are marked with ExperimentalAttribute
, which means they are subject to change in future releases. Additionally, debugger stepping and breakpoints through SVE-generated code may not function properly, resulting in application crashes or corruption of data.
In .NET code, we frequently use index variables to read sequential regions of memory.
Consider the idiomatic for
loop:
static int Sum(int[] nums)
{
int sum = 0;
for (int i = 0; i < nums.Length; i++)
{
sum += nums[i];
}
return sum;
}
For each iteration of the loop, we use i
to read an integer in nums
, and then increment i
. In ARM64 assembly, these two operations look like this:
ldr w0, [x1]
add x1, x1, #4
ldr w0, [x1]
loads the integer at the memory address in x1
into w0
; this corresponds to the access of nums[i]
in the above example. Then, add x1, x1, #4
increases the address in x1
by four bytes, moving to the next integer in nums
. This corresponds to the i++
operation executed at the end of each iteration.
ARM64 supports post-indexed addressing, where the "index" register is automatically incremented after its address is used. This means we can combine the above instructions into one, making the loop more efficient: The CPU only needs to decode one instruction instead of two, and the loop's code is now more cache-friendly.
Here's what the updated assembly looks like:
ldr w0, [x1], #0x04
The #0x04
at the end means the address in x1
will be incremented by four bytes after it is used to load an integer into w0
. In Preview 7, RyuJIT now uses post-indexed addressing when generating ARM64 code. While x64 does not support post-indexed addressing, Preview 2 introduced induction variable widening, which is similarly useful for optimizing memory accesses with loop index variables.
To learn more about post-indexed addressing support in RyuJIT, check out dotnet/runtime #105181.
Strength reduction is a compiler optimization where an operation is replaced with a faster, logically-equivalent operation. This technique is especially useful for optimizing loops.
Let's revisit the example from above:
static int Sum(int[] nums)
{
int sum = 0;
for (int i = 0; i < nums.Length; i++)
{
sum += nums[i];
}
return sum;
}
Here is a snippet of the code generated for the loop's body, in x64 assembly:
add ecx, dword ptr [rax+4*rdx+0x10]
inc edx
These instructions correspond to the expressions sum += nums[i]
and i++
, respectively. rcx
contains the value of sum
, rax
contains the base address of nums
, and rdx
contains the value of i
. To compute the address of nums[i]
, we multiply the index in rdx
by four (the size of an integer), and then add this offset to the base address in rax
, plus some padding. After we've read the integer at nums[i]
and added it to rcx
, we increment the index in rdx
.
In other words, imagine an array with a dozen integers. Given an index in the range [0, 11]
, the above algorithm multiplies it by four to determine the integer's offset from the beginning of the array. So to get the first integer, we compute its address with base_address + (4 * 0)
, and to get the last integer, we compute base_address + (4 * 11)
. With this approach, each array access requires a multiplication and an addition operation.
Multiplication is more expensive than addition, and replacing the former with the latter is a classic motivation for strength reduction. Imagine if we rewrote our example to access the integers in nums
using a pointer, rather than an index variable. With this approach, each memory access wouldn't require us to compute the element's address.
Here's the updated example:
static int Sum(Span<int> nums)
{
int sum = 0;
ref int p = ref MemoryMarshal.GetReference(nums);
ref int end = ref Unsafe.Add(ref p, nums.Length);
while (Unsafe.IsAddressLessThan(ref p, ref end))
{
sum += p;
p = ref Unsafe.Add(ref p, 1);
}
return sum;
}
The source code is quite a bit more complicated, but it is logically equivalent to our initial implementation, and the assembly looks better:
add ecx, dword ptr [rdx]
add rdx, 4
rcx
still holds the value of sum
, but rdx
now holds the address pointed to by p
, so accessing elements in nums
just requires us to dereference rdx
-- all the multiplication and addition from the first example has been replaced by a single add
instruction to move the pointer forward.
In Preview 7, RyuJIT now transforms the first indexing pattern into the second, without requiring you to rewrite any code. To learn more about strength reduction in RyuJIT, check out dotnet/runtime #100913.
Value types, such as int
and struct
, are typically allocated on the stack instead of the heap. However, we frequently need to "box" these value types by wrapping them in objects to enable various code patterns.
Consider the following snippet:
static bool Compare(object? x, object? y)
{
if ((x == null) || (y == null))
{
return x == y;
}
return x.Equals(y);
}
public static int Main()
{
bool result = Compare(3, 4);
return result ? 0 : 100;
}
Compare
is conveniently written such that if we wanted to compare other types, like strings
or doubles
, we could reuse the same implementation. But in this example, it has the performance drawback of requiring us to box any value types we pass to it.
Let's look at the code generated for Main
, in x64 assembly:
push rbx
sub rsp, 32
mov rcx, 0x7FFB9F8074D0 ; System.Int32
call CORINFO_HELP_NEWSFAST
mov rbx, rax
mov dword ptr [rbx+0x08], 3
mov rcx, 0x7FFB9F8074D0 ; System.Int32
call CORINFO_HELP_NEWSFAST
mov dword ptr [rax+0x08], 4
add rbx, 8
mov ecx, dword ptr [rbx]
cmp ecx, dword ptr [rax+0x08]
sete al
movzx rax, al
xor ecx, ecx
mov edx, 100
test eax, eax
mov eax, edx
cmovne eax, ecx
add rsp, 32
pop rbx
ret
Notice the calls to CORINFO_HELP_NEWSFAST
-- those are the heap allocations for the boxed integer arguments. Also notice that there isn't any call to Compare
; RyuJIT decided to inline it into Main
. This means the boxes never "escape." In other words, throughout the execution of Compare
, we know x
and y
are actually integers, and we can safely unbox them without affecting the comparison logic (System.Int32:Equals
just compares the raw integers under-the-hood, anyway). In Preview 7, RyuJIT now allocates unescaped boxes on the stack, unlocking several other optimizations. In our example, not only does RyuJIT avoid the heap allocations, but it also evaluates the expressions x.Equals(y)
and result ? 0 : 100
at compile-time.
Here's the updated assembly:
mov eax, 100
ret
To learn more about object stack allocation in RyuJIT, check out dotnet/runtime #104936.
Dynamic Adaptation To Application Sizes (DATAS) is now enabled by default. It aims to adapt to application memory requirements, meaning the application heap size should be roughly proportional to the long lived data size. DATAS was introduced as an opt-in feature in .NET 8 and has been significantly updated and improved in .NET 9. Any .NET 8 DATAS users should move to .NET 9.
The existing Server GC mode takes a different approach than DATAS. It aims to improve throughput and treats the process as the dominant one on the machine. The amount of allocations it allows before triggering the next GC is based on throughput, not application size. So it can grow the heap very aggressively if it needs to and there’s memory available. This can result in very different heap sizes when you run the process on machines with different hardware specs. What some folks have observed is the heap can grow much bigger when they move their process to a machine with many more cores and more memory. Server GC also doesn’t necessarily adjust the heap down aggressively if the workload becomes much lighter.
Because DATAS aims to adapt to the application memory requirement, it means if your application is doing the same work, the heap size should be the same or similar when you run your app on machines with different specs. And if your workload becomes lighter or heavier the heap size is adjusted accordingly.
DATAS helps most with bursty workloads where the heap size should be adjusted according to how demanding the workload is, particularly as demand decreases. This is especially important in memory-constrained environments where it’s important to fit more processes when some processes’ workloads lighten. It also helps with capacity planning.
To achieve this adaptation and still maintain reasonable performance, DATAS does the following:
-
It sets the maximum amount of allocations allowed before the next GC is triggered based on the long lived data size. This helps with constraining the heap size.
-
It sets the actual amount of allocations allowed based on throughput.
-
It adjusts the number of heaps when needed. It starts with one heap which means if there are many threads allocating, some will need to wait and that negatively affects throughput. DATAS will grow and reduce the number of heaps as needed. In this way, it is a hybrid between the existing GC modes, capable of using as few as one heap (like workstation GC) and as many as matches the machine core count (like server GC).
-
When needed it does full compacting GCs to prevent fragmentation from getting too high which also helps with constraining the heap size.
Here are some benchmarks results for TechEmpower JSON and Fortunes Benchmarks. Notice the significant reduction in working set when running the benchmarks on a 48 core machine with Linux. Max throughput (measured in rps) shows about a 2-3% reduction, but with an working set improvement of 80%+.
With DATAS enabled # of Gen0 and Gen1 GCs are significantly higher
Full .NET TechEmpower benchmark results are published here
If you notice a reduction in throughput, DATAS can be disabled using the following:
DOTNET_GCDynamicAdaptationMode=0
environment variable- setting
System.GC.DynamicAdaptationMode
to0
inruntimeconfig.json
GarbageCollectionAdaptationMode
msbuild property set to0