-
-
Couldn't load subscription status.
- Fork 23.5k
Metal: Stable argument buffers; GPU rendering crashes; visionOS exports #111976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metal: Stable argument buffers; GPU rendering crashes; visionOS exports #111976
Conversation
752e821 to
130c7c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Device Profile is now keyed by platform (macOS, iOS, etc), GPU and minimum OS version. This ensures that when generating or baking the shader, it selects the correct features based on the target OS also.
| /*! Track resource and ensure they are resident prior to dispatch or draw commands. | ||
| * | ||
| * The primary purpose of this data structure is to track all the resources that must be made resident prior | ||
| * to issuing the next dispatch or draw command. It aggregates all resources used from argument buffers. | ||
| * | ||
| * As an optimization, this data structure also tracks previous usage for resources, so that | ||
| * it may avoid binding them again in later commands if the resource is already resident and its usage flagged. | ||
| */ | ||
| struct API_AVAILABLE(macos(11.0), ios(14.0), tvos(14.0)) ResourceTracker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes GPU corruption / crashes by tracking resource usage and ensuring they are resident prior to each command (draw, dispatch, etc)
| void resolve_texture(RDD::TextureID p_src_texture, RDD::TextureLayout p_src_texture_layout, uint32_t p_src_layer, uint32_t p_src_mipmap, RDD::TextureID p_dst_texture, RDD::TextureLayout p_dst_texture_layout, uint32_t p_dst_layer, uint32_t p_dst_mipmap); | ||
| void clear_color_texture(RDD::TextureID p_texture, RDD::TextureLayout p_texture_layout, const Color &p_color, const RDD::TextureSubresourceRange &p_subresources); | ||
| void clear_buffer(RDD::BufferID p_buffer, uint64_t p_offset, uint64_t p_size); | ||
| void copy_buffer(RDD::BufferID p_src_buffer, RDD::BufferID p_dst_buffer, VectorView<RDD::BufferCopyRegion> p_regions); | ||
| void copy_texture(RDD::TextureID p_src_texture, RDD::TextureID p_dst_texture, VectorView<RDD::TextureCopyRegion> p_regions); | ||
| void copy_buffer_to_texture(RDD::BufferID p_src_buffer, RDD::TextureID p_dst_texture, VectorView<RDD::BufferTextureCopyRegion> p_regions); | ||
| void copy_texture_to_buffer(RDD::TextureID p_src_texture, RDD::BufferID p_dst_buffer, VectorView<RDD::BufferTextureCopyRegion> p_regions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the implementation of these from the RenderingDeviceDriverMetal into MDCommandBuffer, for consistency
|
|
||
| public: | ||
| uint32_t index; | ||
| id<MTLBuffer> arg_buffer = nil; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we have a single argument buffer per uniform set vs 100s or more
| return blit.encoder; | ||
| } | ||
|
|
||
| _FORCE_INLINE_ static MTLSize mipmapLevelSizeFromTexture(id<MTLTexture> p_tex, NSUInteger p_level) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following block was moved from the RenderingDeviceDriverMetal into here, to be consistent with the other functions.
| switch (device_profile->platform) { | ||
| case MetalDeviceProfile::Platform::macOS: { | ||
| parts.push_back("-mtargetos=macos" + device_profile->min_os_version.to_compiler_os_version()); | ||
| break; | ||
| } | ||
| case MetalDeviceProfile::Platform::iOS: { | ||
| parts.push_back("-mtargetos=ios" + device_profile->min_os_version.to_compiler_os_version()); | ||
| break; | ||
| } | ||
| case MetalDeviceProfile::Platform::visionOS: { | ||
| parts.push_back("-mtargetos=xros" + device_profile->min_os_version.to_compiler_os_version()); | ||
| break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to account for visionOS when generating Metal binaries
|
|
||
| typedef LocalVector<ReflectUniform> ReflectDescriptorSet; | ||
|
|
||
| struct ReflectShader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We define the reflect objects in the Shader Container, so that data flows outwards from Shader Container. It allows us to evolve what we reflect that is passed to the driver-specific shader containers.
Further, the ReflectShader type is passed to the driver-specific implementations to inspect the reflected SPIR-V.
Previously we were traversing the reflected SPIR-V and constructing RDD::ShaderReflection, which is used by the drivers and RenderingDriver. We were also using ShaderReflection to construct the internal state of the RenderingShaderContainer and also constructing the ShaderReflection from the internal state. We wanted to add more metadata to ShaderReflection, so Metal could build stable bindings, but that would mean changing ShaderReflection.
| } else if (os_name == U"visionOS") { | ||
| min_os_version = (String)p_preset->get("application/min_visionos_version"); | ||
| profile = MetalDeviceProfile::get_profile(MetalDeviceProfile::Platform::visionOS, MetalDeviceProfile::GPU::Apple8, min_os_version); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure we can bake shaders for visionOS
|
❤️ Thanks for the feedback, @AThousandShips – will incorporate all your changes! |
02afa1b to
14fa0a2
Compare
14fa0a2 to
7660797
Compare
|
Thanks @AThousandShips – all your feedback has been incorporated |
7660797 to
1f183b1
Compare
| MDRenderPass(Vector<MDAttachment> &p_attachments, Vector<MDSubpass> &p_subpasses); | ||
| }; | ||
|
|
||
| struct BindingCache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BindingCache is used to avoid redundant binding calls to a MTLCommandEncoder
| class API_AVAILABLE(macos(11.0), ios(14.0), tvos(14.0)) DynamicOffsets { | ||
| uint32_t data; | ||
|
|
||
| public: | ||
| _FORCE_INLINE_ uint32_t get_frame_index(const DynamicOffsetLayout &p_layout) const { | ||
| return data; | ||
| } | ||
| }; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed dead code from #111183
| // A type used to encode resources directly to a MTLCommandEncoder | ||
| struct DirectEncoder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows us to greatly simplify the direct binding code, but unifying MTLRenderCommandEncoder and MTLComputeCommandEncoder binding and caching
1f183b1 to
efb8003
Compare
efb8003 to
97c17ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go ahead with this.
I still have a bit of reservation about the duplication between RDC and ShaderContainer that this introduced. But I understand your rationale for it and can't think of a better option. I don't want to block this work due to my hesitation since it is most likely a result of my lack of familiarity with the ShaderContainer code.
So to move this forward I suggest that we merge this as-is. Then, when Dario is back from vacation, I will ask him to take a look as well and point out if there are any potential issues, or perhaps a better way to avoid the duplication that neither of us are seeing.
|
Thanks! |
Thanks @clayjohn – and I agree. I will spend some time looking at how this could be improved as a more targeted PR that doesn't have as many broad changes. I realise this turned into a large change, which wasn't my intention or at all ideal, but there were many stones unturned… |
Supersedes #110683
Note
Some of the changes were moving code from Metal 3-specific files into a common file, to allow for reuse when adding Metal 4 support.
Summary
The PR addresses the following bugs and regressions:
useResources:APIs #110683The PR adds the following improvements and optimisations:
MTLEventrather than callbacks to handle frame synchronisation on supported OSsdebugPrintfEXTin Metal – which is propagated through. See this for more infoDetails
FIX: Rendering artefacts and GPU crashes
The correct usage of
useResources:count:usage:stages:anduseResources:count:usage:was previously misunderstood, assuming that all resources must be made resident before callingendEncodingon theMTLCommandEncoder. The documentation is clear that resources used by subsequent draw calls must be made resident before encoding the draw or dispatch command:Note
Stable argument buffers reduced the complexity and CPU resources required to manage this data.
FIX: Unable to export and bake shaders for visionOS
visionOS was omitted from the shader baking export, so no shaders were baked and Godot would generate errors.
FIX: Incorrect Metal Shader Language and OS feature targeting
When Metal shaders are generated from SPIR-V and available features determined, only two variables were considered:
However, the minimum OS target version must also be considered, as certain APIs and Metal language features may be unavailable. Improved the Metal shader container to capture all three to determine the available features and what shader features should be generated.
The shader features and then passed to the
RenderingDeviceDriverMetalto ensure it only uses the features specified in the generated shader.Note
In a future PR, we will add support for baking multiple shader versions, so that the target system can choose the best available based on the OS and GPU.
FIX: Performance regression using UMA buffers
UMA buffers for Metal does not use argument buffers when using a UMA buffer, which is all canvas 2D rendering. With the previous implementation, all slots were updated every time each time a uniform set changed. For 2D rendering, when a texture changes frequently, this resulted in costly calls to the Metal command encoder to encode all slots, even if it was only the texture, and possibly the sampler, had changed. This update caches the slots that have changed, so only the minimal Metal binding calls are executed. This should improve performance across the board for all devices using direct / slot binding in Metal
IMPROVEMENT: Reduce memory and CPU usage
The changes to use stable argument buffer bindings means that Metal shaders generated from SPIR-V now produce consistent argument buffer layouts across shader versions and pipeline stages, by using the information from the
RenderingShaderContainer. This class has had some improvements to include additional reflected data that is passed to the device-specific shader containers.By ensuring argument buffer layout is consistent, we no longer have to generate an argument buffer per shader version and stage, which reduces the calculation and layout of 100s per shader variant, in some cases! This was happening for every material in the Bistro demo, which had 100s of materials. That resulted in unique argument buffers for every shader material.
Important
These changes are also preparation for adding Metal 4 support in the future
These changes had small improvements across the board for the Godot reflection benchmark.
FPS
GPU times
Memory improvements
Savings of about 1MB with fewer argument buffer allocations