Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose a malloca API that either stackallocs or creates an array. #52065

Open
tannergooding opened this issue Apr 29, 2021 · 78 comments
Open

Expose a malloca API that either stackallocs or creates an array. #52065

tannergooding opened this issue Apr 29, 2021 · 78 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Memory
Milestone

Comments

@tannergooding
Copy link
Member

tannergooding commented Apr 29, 2021

Background and Motivation

It is not uncommon, in performance oriented code, to want to stackalloc for small/short-lived collections. However, the exact size is not always well known in which case you want to fallback to creating an array instead.

Proposed API

namespace System.Runtime.CompilerServices
{
    public static unsafe partial class Unsafe
    {
        public static Span<T> Stackalloc<T>(int length);
        public static Span<T> StackallocOrCreateArray<T>(int length);
        public static Span<T> StackallocOrCreateArray<T>(int length, int maxStackallocLength);
    }
}

These APIs would be intrinsic to the JIT and would effectively be implemented as the following, except specially inlined into the function so the localloc scope is that of the calling method:

public static Span<T> StackallocOrCreateArray<T>(int length, int maxStackallocLength)
{
    return ((sizeof(T) * length) < maxStackallocLength) ? stackalloc T[length] : new T[length];
}

The variant that doesn't take maxStackallocLength would use some implementation defined default. Windows currently uses 1024.

Any T would be allowed and the JIT would simply do new T[length] for any types that cannot be stack allocated (reference types).

@tannergooding tannergooding added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Apr 29, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Memory untriaged New issue has not been triaged by the area owner labels Apr 29, 2021
@ghost
Copy link

ghost commented Apr 29, 2021

Tagging subscribers to this area: @GrabYourPitchforks, @carlossanlop
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and Motivation

It is not uncommon, in performance oriented code, to want to stackalloc for small/short-lived collections. However, the exact size is not always well known in which case you want to fallback to creating an array instead.

Proposed API

namespace System.Runtime.CompilerServices
{
    public static unsafe partial class Unsafe
    {
        public static Span<T> Stackalloc(int length);
        public static Span<T> StackallocOrCreateArray<T>(int length);
        public static Span<T> StackallocOrCreateArray<T>(int length, int maxStackallocLength);
    }
}

These APIs would be intrinsic to the JIT and would effectively be implemented as the following, except specially inlined into the function so the localloc scope is that of the calling method:

public static Span<T> StackallocOrCreateArray<T>(int length, int maxStackallocLength)
{
    return ((sizeof(T) * length) < maxStackallocLength) ? stackalloc T[length] : new T[length];
}

The variant that doesn't take maxStackallocLength would use some implementation defined default. Windows currently uses 1024.

Any T would be allowed and the JIT would simply do new T[length] for any types that cannot be stack allocated (reference types).

Author: tannergooding
Assignees: -
Labels:

api-suggestion, area-System.Memory, untriaged

Milestone: -

@tannergooding
Copy link
Member Author

This issue came up on Twitter again (https://twitter.com/jaredpar/status/1387798562117873678?s=20) and we have valid use cases in the framework and compiler.

This has been somewhat stuck in limbo as runtime/framework saying "we need language support first" and the language saying "we need the runtime/framework to commit to doing this first".

We should review and approve this to unblock the language from committing to their work and can do all the appropriate implementation/prep work on the runtime side, without actually making it public until the language feature is available.

CC. @jkotas, @jaredpar, @GrabYourPitchforks

@tannergooding
Copy link
Member Author

There would not be an API which opts to use the ArrayPool today. We don't use the ArrayPool for any type, but rather only specific sets of types. An API which does use some ArrayPool would likely need to return some other type which indicates whether the type needs to be returned.

Stackalloc was added at the request of Jared who gave the following reasoning:

stackalloc

  • returns a pointer hence no var
  • works only on where T: unmanaged

Yes we could use a runtime feature flag to let the compiler remove the restriction on where T : unmanaged. But that doesn't fix the var issue. At the point we're adding new APIs for stack alloc then seems better to simplify and use them for all forms of stack alloc. Makes the code more consistent, lets you have the same type of call sites (can flip between forms without having to also say flip var)

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Apr 29, 2021

Related to #25423. That proposal is a bit light on concrete APIs, but it suggests behaviors / analyzers / other ecosystem goodness we'd likely want to have around this construct.

@jaredpar
Copy link
Member

This does require language changes to work correctly but the implementation is very straight forward. The compiler will just treat all of these calls as if they are not safe to escape from the calling method. Effectively it would have the same lifetime limitation as calling stackalloc today.

I think the best approach is to just have the compiler trigger on the FQN of the method. Essentially any API with this signature in any assembly would be treated this way. That would make it easier to write code that multi-targets between .NET Core and .NET Framework as the framework side of this could be implemented as new T[length] in all cases.

The other advantage of this API is that w can once again var all the things.

var local1 = stackalloc int[42]; // int*
var local2 = Unsafe.StackAlloc<int>(42); // Span<int>

@jkotas
Copy link
Member

jkotas commented Apr 29, 2021

This is one of those features that requires a joint work from all runtimes/JIT/language/libraries. Our .NET 6 budget for features in this space was taken by the generic numerics. We should include this proposal next time we do planning in this area.

Approving this API without the resource commintment won't achieve much.

@tannergooding
Copy link
Member Author

tannergooding commented Apr 29, 2021

Approving this API without the resource commintment won't achieve much.

It gives us a surface on which this can be implemented given "free time" and can be prioritized appropriately.

The library work is approving the API and exposing the surface area.
The language work is recognizing these methods and based on Jared's comment is relatively trivial.

The JIT work should just be implementing it as a recursive named intrinsic and then creating the relevant nodes for:

if ((sizeof(T) * length) < maxStackallocLength)
{
    var x = stackalloc T[length];
    return new Span<T>(x, length);
}
else   
{
    var x = new T[length];
    return new Span<T>(x);
}

This is fairly straightforward, except for the newobj calls which involves getting the relevant method tokens

@jkotas
Copy link
Member

jkotas commented Apr 29, 2021

The JIT work should just be implementing it as a recursive named intrinsic and then creating the relevant nodes for:

I do not think we would want to do a naive implementation like this. I think we would want to do explicit life-time tracking even when the lenght is over the threashold.

@tannergooding
Copy link
Member Author

I think we would want to do explicit life-time tracking even when the lenght is over the threashold.

What's the scenario where the JIT needs to do additional tracking that isn't already covered by the language rules and by the existing tracking for Span<T>?

Users can and already do write the above today, just manually inlined. We are looking at doing exactly this already in one of the BigInteger PRs.
This is an API on Unsafe that exists to help simplify the logic and behave like alloca does in C/C++ and can be immensely simplified/unblock scenarios by doing the trivial implementation.
It then becomes a drop in replacement for what users are already doing.

@jkotas
Copy link
Member

jkotas commented Apr 29, 2021

What's the scenario where the JIT needs to do additional tracking that isn't already covered by the language rules and by the existing tracking for Span?

We would be leaving performance on the table.

Majority of the existing stackalloc uses are using ArrayPool as the fallback. If the new API is not using pooled memory as the fallback, the majority of the existing stackalloc sites won't be able to use it.

@jaredpar
Copy link
Member

It is also could be a common pattern to use stackalloc or ArrayPool, e.g.:

That requires a different level of language support. Supporting the non-arraypool case is very straight forward. It's just generalizing the existing lifetime restrictions we associate with stackalloc to instead be a collection of API calls. It's closer to a bug fix level of work than a new feature.

The ArrayPool case is very doable but it's definitely in the "new feature" category because we have to do the work to handle Free and design some cases around it. Example: do we make locals initialized with these APIs as unassignable? If we allow reassignment then we have to consider how that impacts us calling Free with the resulting value. Solvable items but definitely a bit of design work that needs to go into it.

@tannergooding
Copy link
Member Author

That really sounds like an additional ask and one isn't strictly needed at the same time.

Pooling has a lot of different considerations and we ourselves largely only use it with a few primitive types (namely byte), not any T. It likewise requires:

  • a way to get the array from a Span<T>
  • knowing that Span<T> definitely points to the start of an Array
  • may involve custom pooling and not just ArrayPool
  • etc

I think its doable, but we could also unblock many scenarios with the above today and with minimal work.

@jkotas
Copy link
Member

jkotas commented Apr 29, 2021

The ArrayPool case is very doable but it's definitely in the "new feature" category because we have to do the work to handle Free and design some cases around it.

I do not think we would necessarily want to deal with the pooling in Roslyn, nor have it backed by the ArrayPool as it exist today.

@jkotas
Copy link
Member

jkotas commented Apr 29, 2021

we could also unblock many scenarios with the above today and with minimal work.

I do not see those scenarios. The minimal work just lets you do the same thing as what you can do with stackalloc today, just maybe saves you a few characters.

@tannergooding
Copy link
Member Author

I do not see those scenarios

They exist everywhere that alloca is used in native. They exist in 3rd party libraries like ImageSharp.
They exist for types where array pooling isn't desirable because pooling has its own overhead and costs (and generally cost per type).

None of the existing proposals or discussions around this, including #25423 which has been around for 3 years, have really covered pooling as that is considered a more advanced scenario.

This covers the case of "I want to allocate on the stack for small data and on the heap for larger data" and where the limit for that might vary between platforms and architectures. Windows for example has a 1MB stack by default and uses 1024 bytes. Linux uses a 4MB stack and might want a different limit.

Encountering large lengths is typically expected to be rare, but not impossible. Its not unreasonable to simply new up an unpooled array in that scenario.

@tannergooding
Copy link
Member Author

Pooling, for example, is likely only beneficial for types like byte, char, or int which are (for the vast majority case) the only usages in the BCL: https://source.dot.net/#System.Private.CoreLib/ArrayPool.cs,704a680ba600a2a4,references

@EgorBo
Copy link
Member

EgorBo commented Apr 29, 2021

stackalloc + new:

    Span<byte> span = Unsafe.StackallocOrCreateArray(len, 1024);
    // vs 
    Span<byte> span = len > 1024 ? new byte[len] : stackalloc byte[1024];

Indeed just saves a few characters (but nice to have).

But stackalloc + arraypool should save a lot 🙂 :

    byte[] arrayFromPool = null;
    Span<byte> span = len > 1024 ? (arrayFromPool = ArrayPool<byte>.Shared.Rent(len)) : stackalloc byte[1024];
    try
    {
    }
    finally
    {
        if (arrayFromPool != null)
            ArrayPool<byte>.Shared.Return(arrayFromPool );
    }

    // vs 
    Span<byte> span = Unsafe.StackallocOrPool(len, 1024);

@jaredpar
Copy link
Member

Indeed just saves a few characters (but nice to have).

Has a couple of other benefits:

  1. Path forward for supporting unmanaged types in stackalloc, particularly for code that needs to cross compile between .NET Core and Framework
  2. Supports var

@jaredpar
Copy link
Member

But stackalloc + arraypool should save a lot

I'm now seeing conflicting advice on whether or not arrays should be returned to the pool in a finally. Had others suggest that the finally is too much overhead and best to just let the array leak in the case of an exception.

@gfoidl
Copy link
Member

gfoidl commented Apr 29, 2021

@EgorBo your example with the pool would save even more when the Span is sliced to the desired length (as it's often needed that way when the length is given as argument).

@jkotas
Copy link
Member

jkotas commented Apr 29, 2021

They exist for types where array pooling isn't desirable because pooling has its own overhead and costs (and generally cost per type).

This is due to current array pool design limitations. This is fixable by treating management of explicit lifetime memory as core runtime feature.

I'm now seeing conflicting advice on whether or not arrays should be returned to the pool in a finally

This depends on how performance sensitive your code is and how frequenly you expect exceptions to occur inside the scope. If your code is perf critical (e.g. number formatting) and you do not expect exceptions to ever occur inside the scope (e.g. the only exception you ever expect is out of memory), it is better to avoid finally as it is the common case in dotnet/runtime libraries.

@tannergooding
Copy link
Member Author

This is due to current array pool design limitations. This is fixable by treating management of explicit lifetime memory as core runtime feature.

That also sounds like a feature that is potentially several releases out and which is going to require users and the compiler to review where it is good/correct to use.

Something like proposed here is usable in the interim, including for cases like params Span<T>. It represents something that many languages do provide and which is part of the "standard" set of memory allocation APIs commonly exposed by languages.
It likewise fills a gap for languages that don't have unsafe support or which don't have an implicit conversion to span, such as F#: fsharp/fslang-suggestions#720

Having to do Span<byte> span = len > 1024 ? new byte[len] : stackalloc byte[1024]; everywhere and then go and find/update every callsite if you end up changing the behavior or size limit isn't great.
Having an API is good for the same reason all helper/utility methods are good and helps with refactorings, maintainability, finding usages of the pattern, etc. It also allows it to easily be customized for different stack sizes, at least for Unix vs Windows and possibly vs MacOS or ARM or 32-bit vs 64-bit.

@xoofx
Copy link
Member

xoofx commented Apr 29, 2021

What about making the allocator not necessarily bound to new byte[len] or ArrayPool<byte>.Shared.Rent(len) (e.g could come from e.g unmanaged memory pool)

namespace System.Runtime.CompilerServices
{
    public static unsafe partial class Unsafe
    {
        public static Span<T> Stackalloc<TAllocator, T>(int length, TAllocator allocator) 
                                                 where TAllocator: ISpanAllocator<T>
        // ...
    }
    
    public interface ISpanAllocator<T> {
         Span<T> Allocate(int length);   
    }
}

@benaadams
Copy link
Member

Any T would be allowed and the JIT would simply do new T[length] for any types that cannot be stack allocated (reference types).

Could also allocate a series of ref fields (all null); and then allow indexing them as via Span

@tannergooding
Copy link
Member Author

tannergooding commented Apr 29, 2021

What about making the allocator not necessarily bound to new byte[len] or ArrayPool.Shared.Rent(len)

I think any API that isn't tracking either T* or T[] would need to return something like DisposableSpan<T> so the appropriate free could occur (or would need language support for the relevant TAllocator.Free to be called).

Otherwise, I think it falls into the general camp of what it seems @jkotas is proposing with runtime supported lifetime tracking.

@xoofx
Copy link
Member

xoofx commented Apr 29, 2021

I think any API that isn't tracking either T* or T[] would need to return something like DisposableSpan<T> so the appropriate free could occur (or would need language support for the relevant TAllocator.Free to be called).

Oh, yeah true, Let's do it! 😅

@xoofx
Copy link
Member

xoofx commented Apr 29, 2021

That starts to be as painful as implementing IEnumerable<T> 🙃

namespace System.Runtime.CompilerServices
{
    public static unsafe partial class Unsafe
    {
        public static TState Stackalloc<TAllocator, TState, T>(int length, TAllocator allocator, out Span<T> span) 
                                                 where TAllocator: ISpanAllocator<T, TState>
        // ...
    }
    
    public interface ISpanAllocator<T, TState> {
        Span<T> Allocate(int length, out TState state);   
        void Release(TState state);
    }
}

[Edit] Removed where TState: IDisposable as we have already ISpanAllocator.Release
[Edit2] Arg, actually, maybe it was better with the IDiposable, I told you, it's more painful than IEnumerable

@colejohnson66
Copy link

Except a language-level translation won't suffice because stackallocs don't always have a compile-time size. stackalloc T[count] is a common thing. Why can't the runtime just remove that restriction?

@tannergooding
Copy link
Member Author

stackalloc T[count] is a common thing

It's not that common, in part because it is very expensive and often slower than simply new T[count]. stackalloc can come with many additional considerations, needs stack overflow checks, more expensive zeroing, buffer overrun protection, and more things due to the potential security issues that occur.

It is then "best practice" to keep stack allocations small (all stackallocs for a single method should typically add up to not more than 1024 bytes) and to never make them "dynamic" in length (instead rounding up to the largest buffer size).

This guidance is true even in native code (C, C++, assembly, etc) and not following it can in some cases interfere with or break internal CPU optimizations (such as mirroring stack spills to the register file).

@Aniobodo
Copy link

It is then "best practice" to keep stack allocations small (all stackallocs for a single method should typically add up to not more than 1024 bytes) and to never make them "dynamic" in length (instead rounding up to the largest buffer size).

Dynamic length works well if you reliably know your data source.

@tannergooding
Copy link
Member Author

Dynamic lengths function as intended in many scenarios. However, they can lead to various issues including hurting performance and potentially opening yourself up to security problems (even if the data source is known).

There are multiple recommendations in this space that are effectively industry standard and they allow you to achieve the same overall thing without introducing the same risks. Those industry standards and recommendations should be considered alongside any API exposed here or future work done by the runtime to enable new scenarios.

@EgorBo
Copy link
Member

EgorBo commented Feb 5, 2025

The example from #112178 adds a motivation to introduce the API 🙂
Image

@tannergooding
Copy link
Member Author

tannergooding commented Feb 5, 2025

@jkotas how would we feel about the following shape (possibly with a better name than StackallocOrRentedBuffer):

namespace System.Runtime.CompilerServices
{
    public static unsafe partial class Unsafe
    {
        public static int DefaultStackallocThreshold { get; }

        // TODO: Determine which compiler attributes or keywords need to be added for correct lifetime tracking.

        [Intrinsic]
        public static Span<T> StackallocOrCreateArray<T>(int length) => StackallocOrCreateArray(length, DefaultStackallocThreshold);

        [Intrinsic]
        public static Span<T> StackallocOrCreateArray<T>(int length, int maxStackallocLength) => new T[length];

        [Intrinsic]
        public static StackallocOrRentedBuffer<T> StackallocOrRentArray<T>(int length) => StackallocOrRentArray(length, DefaultStackallocThreshold);

        [Intrinsic]
        public static StackallocOrRentedBuffer<T> StackallocOrRentArray<T>(int length, int maxStackallocLength)
        {
            T[] rentedArray = ArrayPool<T>.Shared.Rent(length);
            return new StackallocOrRentedBuffer(rentedArray.AsSpan(0, length), rentedArray);
        }
    }

    public ref struct StackallocOrRentedBuffer<T>
    {
        private Span<T> _buffer;
        private T[]? _rentedArray;

        public StackallocOrRentedBuffer(Span<T> buffer, T[]? rentedArray = null)
        {
            _buffer = buffer;
            _rentedArray = rentedArray;
        }

        public static implicit operator Span<T>(StackallocOrRentedBuffer value) => value._buffer;

        public static implicit operator ReadOnlySpan<T>(StackallocOrRentedBuffer value) => value._buffer;

        public void Dispose()
        {
            if (_rentedArray is not null)
            {
                ArrayPool<T>.Shared.Return(_rentedArray);
                _rentedArray = null;
            }
        }
    }
}

This:

  • covers the desire to allow for renting or for allocating a new array
  • gives a way to centrally manage a default threshold while also giving users control to customize it for special scenarios
  • allows users renting to utilize using ... such that a try/finally return of the rented buffer occurs -or- to call dispose themselves, without worrying about whether it was actually rented
  • has a safe managed implementation that never stackallocates, avoiding the issue of needing language support
  • can be handled intrinsically by the JIT, such that we logically emit the (length <= threshold) ? stackalloc[] : new T[]

The biggest complexity here is getting the JIT to "call" ArrayPool<T>.Shared.Rent for the StackallocOrRent scenario. However, there are a few ways that can be handled.... For example, we could avoid the complexity by recognizing the call and effectively having the JIT transform it from GT_CALL into (length <= threshold) ? stackalloc[] : GT_CALL and simply allowing the inliner to inline the actual ArrayPool<T>.Shared.Rent call itself, this should be relatively easy for the importer to do and wouldn't require more significant JIT/VM changes to support.

@tannergooding
Copy link
Member Author

Such a shape would simplify the example @EgorBo just shared down to:

StackallocOrRentedBuffer powersOf1e9Buffer = Unsafe.StackallocOrRent<uint>(powersOf1e9BufferLength);

// ...

powersOf1e9Buffer.Dispose();

@tannergooding
Copy link
Member Author

tannergooding commented Feb 5, 2025

Another benefit of such APIs would be that the Roslyn compiler could use them instead of us exposing InlineArray2/3/4/5<T>: #111973

As the JIT could ensure that the GC tracking is correct for a stackallocated reference type T and thus avoid the need to expose concrete types for whatever is internal implementation detail of the code around params Span<T>

@hamarb123
Copy link
Contributor

hamarb123 commented Feb 5, 2025

Such a shape would simplify the example @EgorBo just shared down to:

StackallocOrRentedBuffer powersOf1e9Buffer = Unsafe.StackallocOrRent(powersOf1e9BufferLength);

// ...

powersOf1e9Buffer.Dispose();

Non-critical dispose would additionally allow it to be simplified down to this :)

using StackallocOrRentedBuffer<uint> powersOf1e9Buffer = Unsafe.StackallocOrRent<uint>(powersOf1e9BufferLength);

I generally agree with the idea that they shouldn't be returned in the case of an exception (as you might have it stored somewhere - although with this API design, you could write the lifetimes such that it's not possible I think, it may already be that way, but I haven't checked for sure), and there's the benefit wrt no try/finally for what most developers are likely to type (the using) if we're being honest.

Note: this is just a nice-to-have for the most part imo, as opposed to something that I think should block this proposal.

ArrayPool.Shared.Return(_rentedArray);

Are we sure we don't want to clear ever? I think at least an overload of dispose with clear as an option would be good, and perhaps the default should be true for managed types also. (e.g., with the current design as written, you might store a reference into it, have an exception thrown before you got around to clearing it, return the array without clearing it in the finally, and then potentially leak the reference as a result)

Another benefit of such APIs would be that the Roslyn compiler could use them instead of us exposing InlineArray2/3/4/5<T>: #111973

As the JIT could ensure that the GC tracking is correct for a stackallocated reference type T and thus avoid the need to expose concrete types for whatever is internal implementation detail of the code around params Span<T>

I don't agree with this usage personally, for 2 reasons: 1. (the trivially solveable one) if we do this, we should also expose an Unsafe.Stackalloc<T> imo; 2. my understanding is that there's benefits over using a proper normal local over dynamically stack-allocated memory, so unless the suggestion is for the roslyn to just pre-emptively stackalloc & store in local in advance for the whole duration of the method (which has its own issues, as this introduces hidden allocations or cost for using the pool when you didn't want/need it), then my understanding is that it would potentially regress some scenarios.

If we were to do something like this, imo it should be reserved for only high item count / total size scenarios, or dynamic scenarios (which would be questionable to do implicitly regardless).

Beyond those comments, I personally think the proposal is good :)

@EgorBo
Copy link
Member

EgorBo commented Feb 5, 2025

I like the proposed API shape just to simplify what we already have. The question is - can we make it safer? Can it avoid using the thread pool under the hood if it sees that the pooled array escapes outside of the scope (using inter-procedure analysis)? Or at least help developers diagnose issues where it does

@tannergooding
Copy link
Member Author

Are we sure we don't want to clear ever? I think at least an overload of dispose with clear as an option would be good, and perhaps the default should be true for managed types also. (e.g., with the current design as written, you might store a reference into it, have an exception thrown before you got around to clearing it, return the array without clearing it in the finally, and then potentially leak the reference as a result)

Clearing would require tracking extra state. It could be an option, but its also not something the BCL normally does and it's not the default for ArrayPool<T>

If you need to clear, its not difficult to do buffer.Clear() or similar (same as you might opt to do on allocation). But, we also aren't blocked from adding such an overload in the future if enough requests happen.

If we were to do something like this, imo it should be reserved for only high item count / total size scenarios, or dynamic scenarios (which would be questionable to do implicitly regardless).

The JIT can optimize constant sizes to not emit the fallback path, because it statically knows its under the threshold. This makes it zero cost for anything that isn't dynamic. If it is dynamic, then you want the branch to avoid the potential dangers of stackoverflow.

Thus, there is no need for just a Unsafe.Stackalloc<T>, you can either use stackalloc because its "safe" for unknown lengths of unmanaged types, or you use the StackallocOr* API with reference types, because its not safe for dynamic lengths.

@hamarb123
Copy link
Contributor

hamarb123 commented Feb 5, 2025

Are we sure we don't want to clear ever? I think at least an overload of dispose with clear as an option would be good, and perhaps the default should be true for managed types also. (e.g., with the current design as written, you might store a reference into it, have an exception thrown before you got around to clearing it, return the array without clearing it in the finally, and then potentially leak the reference as a result)

Clearing would require tracking extra state. It could be an option, but its also not something the BCL normally does and it's not the default for ArrayPool<T>

If you need to clear, its not difficult to do buffer.Clear() or similar (same as you might opt to do on allocation). But, we also aren't blocked from adding such an overload in the future if enough requests happen.

It could simply pass RuntimeHelpers.IsReferenceOrContainsReferences<T>() by default (this doesn't require extra state) (this is what I meant by and perhaps the default should be true for managed types also 😁), and have an overload of Dispose (or something named similar if we don't want to confuse people by making it the same sig as the Dispose(bool disposing) pattern) that allows the user to specify if they wanted different behaviour.

The JIT can optimize constant sizes to not emit the fallback path, because it statically knows its under the threshold. This makes it zero cost for anything that isn't dynamic. If it is dynamic, then you want the branch to avoid the potential dangers of stackoverflow.

Thus, there is no need for just a Unsafe.Stackalloc<T>, you can either use stackalloc because its "safe" for unknown lengths of unmanaged types, or you use the StackallocOr* API with reference types, because its not safe for dynamic lengths.

And this doesn't run into issues with causing pessimisation over the exiting solution in other dynamic scenarios, like calling a function with params ROS<object/T> in a loop or in a branch (or both)? My understanding was that this is part of the reason (but not the whole reason, obviously) that we added InlineArray in the first place, as opposed to just having an api to stackalloc any type that the c# compiler could call. If so, then that sounds fine :)

@tannergooding
Copy link
Member Author

And this doesn't run into issues with causing pessimisation over the exiting solution in other dynamic scenarios, ...?

It should not. For static lengths and/or for the intrinsic API in particular there should be nothing preventing us from hoisting such a case since we know the scoping of it due to the lifetime. There's also nothing preventing Roslyn from doing said hoisting itself since it would already be something implicitly done behind the scenes.

There's tradeoffs, its just a suggestion for how this could also benefit that scenario.

It could simply pass RuntimeHelpers.IsReferenceOrContainsReferences() by default (this doesn't require extra state)

Yes, but its tradeoffs and largely irrelevant to the basic shape getting a nod of approval so we can finally take it to API review. We can discuss additional concepts like clearing in API review and decide if a third overload or parameter is warranted.

@jkotas
Copy link
Member

jkotas commented Feb 5, 2025

@jkotas how would we feel about the following shape

It is still an unsafe API with similar problems as ValueStringBuilder. I think we should be shooting for a construct that makes it impossible to introduce a memory safety bug.

@MitchRazga
Copy link

It is still an unsafe API with similar problems as ValueStringBuilder. I think we should be shooting for a construct that makes it impossible to introduce a memory safety bug.

Perhaps something along the lines of ref struct destructor?

@tannergooding
Copy link
Member Author

It is still an unsafe API with similar problems as ValueStringBuilder. I think we should be shooting for a construct that makes it impossible to introduce a memory safety bug.

@jkotas could you elaborate a bit on what you'd be expecting the long term solution to be, possible with some pseudo-code?

Notably I don't see the potential issue with StackallocOrCreateArray<T>(). That is we only need provide an analyzer encouraging users to do scoped Span<T> buffer = Unsafe.StackallocOrCreateArray<T>(...) themselves. Because we know it stackallocs or creates an array, there is no need for a disposal pattern and the scoped keeps it from escaping. A language feature that allows us to declare the signature as scoped Span<T> StackallocOrCreateArray<T>(int length) would be even better and make this implicit, but it isn't "required". So this API seems completely safe to expose and doesn't necessarily have to be on the Unsafe class, its rather just a power API and this is a convenient place for it.

With the StackallocOrRentedBuffer you do have the issue of:

scoped StackallocOrRentedBuffer<T> buffer = Unsafe.StackallocOrRentArray<T>(...);
scoped Span<T> span = buffer;

buffer.Dispose();
span[0] = ...; // mutating a buffer that's been returned to the array pool

This issue exists because of how Dispose works on structs in general and there isn't really a "safe" option for it without the language exposing a much more expensive language feature around ownership/lifetimes and move only semantics (you need to ensure prior instances become dead on copy and need to ensure that it understand Dispose "ends" the scope early so any derived buffers also end at that point in time).

While this danger does exist on the latter, the same danger notably already exists in the paths that would use it just hidden around much more convoluted API surface. So it seems like exposing such an API is still improving safety and would be worth it for power users given the other feature may still be years out.

Short of a language feature, I could see the signature being Span<T> StackallocOrRentArray<T>(int length) to avoid this issue (with the same note about an analyzer and requiring the asignee involve scoped), but it would require the JIT to internally emit a T[] local and cleanup for such values in the function epilogue. This seems doable, just a bit more expensive to achieve.

@jaredpar
Copy link
Member

jaredpar commented Feb 5, 2025

That is we only need provide an analyzer encouraging users to do scoped Span buffer = Unsafe.StackallocOrCreateArray(...) themselves.

Rather than an analyzer I think we'd want to encode this into the compiler. This has come up a few times in the past. Essentially how can the runtime mark an API that is ref struct returning as "cannot escape calling method"? Once that is in place APIs like this could be tagged with the attribute and it would just fall into our existing ref safety logic. Don't even need an explicit scoped cause compiler will infer the lifetime correctly to be scoped. It's a fairly low cost item for the compiler, just never had the runtime motivation to go through with it.

Short of a language feature, ...

I'm somewhat warry of taking this entire idea and encoding it as a language feature. Basically I'm hesitant about having Span<byte> b = malloc(...). That is putting too much logic into the compiler that feels like it should really be in the runtime. I'm definitely supportive though of ideas like I mentioned above where we can help give you the tools to enforce APIs like this.

@rickbrew
Copy link
Contributor

rickbrew commented Feb 5, 2025

Yeah I'd love to have something like this but it would probably need even more compiler support

using StackAllocOrCustomAllocBuffer<T> buffer = Unsafe.StackAllocOrCustomAlloc<T>(length, static len => MyCustomWin32HeapAllocator(len));

@EgorBo
Copy link
Member

EgorBo commented Feb 5, 2025

I presume the perfect solution is something like this:

byte[] buffer = Buffer.AllocateTempBuffer(len);

and the JIT/C# either uses stackalloc (if it's not inside a loop, size is not too big), ArrayPool (if it can guarantee that the buffer never escapes the current scope and the Return is correctly placed - this is tricky as most likely it requires an inter-procedure analysis) or heap alloc otherwise.

Maybe even just byte[] buffer = new byte[len]; Although, in this case it might be tricky/against the spec to use ArrayPool internally.

Meanwhile Tanner's impl looks good as an internal helper to replace all ugly code we have in BCL IMO.

@jaredpar
Copy link
Member

jaredpar commented Feb 5, 2025

byte[] buffer = Buffer.AllocateTempBuffer(len);

If the issue is around determining escape, why a byte[] vs. a Span<byte>? The latter with compiler help should be much easier to guarantee about escape.

@EgorBo
Copy link
Member

EgorBo commented Feb 5, 2025

byte[] buffer = Buffer.AllocateTempBuffer(len);

If the issue is around determining escape, why a byte[] vs. a Span<byte>? The latter with compiler help should be much easier to guarantee about escape.

Good point. Although, say I have:

void Test(bool cond)
{
    char tmp = 'x';
    ref char c = ref tmp;
    if (cond)
    {
        Span<char> span = Buffer.AllocateTempBuffer(100000);
        c = ref span[0];  // escapes the current scope?
        // ArrayPoo.Return is no longer legal here?
    }
    c = 'x'; // out of scope
}

@tannergooding
Copy link
Member Author

I presume the perfect solution is something like this:

At the end of the day, there's a balance between fine tuning perf by hand and relying on implicit optimizations. Within implicit optimizations there is then a further balance between things that can be statically relied on and things which require heuristics.

For things we can statically guarantee, I think it's completely acceptable to rely on the implicit optimizations. This is especially true for things like standard for loops allowing bounds check elimination as the code stays idiomatic, safe, and achieves the desired performance. I likewise imagine there are some scenarios where the JIT can better understand lifetimes by knowing things are ref structs and similar and so there are some viable optimizations that can be done here statically.

However, for things that rely on heuristics like deciding whether a T[] can be stack allocated based on escape analysis, I don't think this is a viable solution for users writing perf-critical code and is instead simply a "nice to have" for all other regular non perf critical code. That is, things like BigInteger or other framework APIs are intentionally choosing to stackalloc within a given threshold and the places this buffer is used are then often in context of complex code where it is passed through non-inlined or large callsites and so the escape analysis will fail. We could of course try and track additional metadata per method and build up a form of Whole Program Analysis as we get into Tier 1 or even some Tier 2, but this adds a lot of overhead and expense that I expect is unlikely to pay off. I expect its especially unlikely to pay off if we're already hand tuning known perf sensitive scenarios with the relevant helpers.

There's also the consideration that by trying to optimize all code to the extreme, we can actually reduce the performance of the whole application. The stack is intentionally small/limited, CPUs have hardware level features that assume call depth is smallish and that assume the number of stack slots used per call is relatively small. They have features like tracking the last 12-15 return addresses so that it is cheap to leave a method. They have features like mirroring stack spills to the larger internal register file to avoid the penalties with memory accesses or polluting L1 cache for something that is only ever going to be CPU local. If you start stack allocating or start renting from the pool implicitly for all cases of new T[], then you actually risk reducing performance for the application and leaving nothing available for the bits of code that were hand tuned to utilize those features instead. -- For example, code that was expecting to rent from the pool for perf may find it drained because code that didn't actually benefit from using the pool went and got the available allocations first. It's all a balance and sometimes it will win and sometimes it won't, but the things that were explicitly tuned are the ones we actually care about being fast.

So at the end of the day, I don't think we're ever going to be getting rid of features like proposed here; they will always exist and always have a place for use. I think then the right thing is trying to ensure that when they are exposed we strike the right balance of ensuring they are safe, understandable, and helper users write more idiomatic looking code while still being explicit that they want a particular optimization to kick in.

@jaredpar
Copy link
Member

jaredpar commented Feb 6, 2025

Good point. Although, say I have:

void Test(bool cond)
{
    char tmp = 'x';
    ref char c = ref tmp;
    if (cond)
    {
        // With compiler integration this would be implicitly scoped. 
        Span<char> span = Buffer.AllocateTempBuffer(100000);
        
        // This is now illegal because you're attempting to escape `span` to a broader
        // scope
        c = ref span[0];  // escapes the current scope?
  
    }
    c = 'x'; // never get here 
}

Can see the impact by making span explicitly scoped (sharplab)

Note: my assumption here is that AllocateTempBuffer would desire to have lifetime of current scope. If the intent is that it's always current method then that is equally doable and the sample would start compiling. That does complicate a bit when the free happens. But in terms of lifetime it's not much different to add both options if we decide to go this route.

@EgorBo
Copy link
Member

EgorBo commented Feb 6, 2025

Can see the impact by making span explicitly scoped

The question is - can you enforce it? Or the safety is going to be based on an analyzer that will suggest user to put scoped there?

Ah, the answer is here #52065 (comment)

@jaredpar
Copy link
Member

jaredpar commented Feb 6, 2025

The question is - can you enforce it? Or the safety is going to be based on an analyzer that will suggest user to put scoped there?

Ah, the answer is here #52065 (comment)

Correct. My suggestion was the language provide a mechanism to let the runtime dictate that a ref struct returning method had a specific lifetime. Imagine for example there was an attribute that said [LifetimeCurrentMethod] or [LifetimeCurrentScope] that compiler would recognize as part of it's lifetime rules.

Span<byte> Use1() => M1(); // Error: M1 is safe to escape to current method
Span<byte> Use2() => M2(); // okay

[LifetimeCurrentMethod]
Span<byte> M1() => ...
Span<byte> M2() => ...;

The attributes would allow only to restrict returned lifetimes, not increase them.

@jkotas
Copy link
Member

jkotas commented Feb 10, 2025

@jkotas could you elaborate a bit on what you'd be expecting the long term solution to be, possible with some pseudo-code?

The discussion between Jared and Egor above is a good example of the kind of solutions we should be thinking about.

it seems like exposing such an API is still improving safety and would be worth it for power users given the other feature may still be years out.

Replacing unsafe APIs with a bit more safe unsafe APIs is not very appealing to me. We can do it occasionally if there is a strong motivation, but it won't get us to where we need to be. Memory safety is not like performance where the incremental approach works fine.

feature may still be years out

I expect that we will need to be prioritizing memory safety related features higher compared to where we were in the past. Message from our CEO: Prioritizing security above all else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Memory
Projects
None yet
Development

No branches or pull requests