Skip to content

Conversation

@stuartcarnie
Copy link
Contributor

@stuartcarnie stuartcarnie commented Oct 24, 2025

Supersedes #110683

Note

Some of the changes were moving code from Metal 3-specific files into a common file, to allow for reuse when adding Metal 4 support.

Summary

The PR addresses the following bugs and regressions:

  • Rendering artefacts and GPU crashes for all Apple Silicon
  • Unable to export and bake shaders for visionOS
  • Incorrect Metal Shader Language and OS feature targeting in shader baker
  • Performance regression using UMA buffers

The PR adds the following improvements and optimisations:

  • Reduce memory and CPU usage by generating stable argument buffer bindings for shaders
  • Uses lodBias on 26.0+ OSs and removes the warning "Metal does not support LOD bias for samplers."
  • Uses GPU encoded MTLEvent rather than callbacks to handle frame synchronisation on supported OSs
  • Adds support for debugPrintfEXT in Metal – which is propagated through. See this for more info

Details

FIX: Rendering artefacts and GPU crashes

The correct usage of useResources:count:usage:stages: and useResources:count:usage: was previously misunderstood, assuming that all resources must be made resident before calling endEncoding on the MTLCommandEncoder. The documentation is clear that resources used by subsequent draw calls must be made resident before encoding the draw or dispatch command:

You can make multiple resources resident (available in GPU memory) for the remaining duration of the render pass by calling this method. Call the method before encoding draw calls that may access the elements of resources through an argument buffer. The method ensures each resource is in a format that’s compatible with the shaders that depend on it.

Note

Stable argument buffers reduced the complexity and CPU resources required to manage this data.


FIX: Unable to export and bake shaders for visionOS

visionOS was omitted from the shader baking export, so no shaders were baked and Godot would generate errors.


FIX: Incorrect Metal Shader Language and OS feature targeting

When Metal shaders are generated from SPIR-V and available features determined, only two variables were considered:

  • Minimum GPU, and Metal language version

However, the minimum OS target version must also be considered, as certain APIs and Metal language features may be unavailable. Improved the Metal shader container to capture all three to determine the available features and what shader features should be generated.

The shader features and then passed to the RenderingDeviceDriverMetal to ensure it only uses the features specified in the generated shader.

Note

In a future PR, we will add support for baking multiple shader versions, so that the target system can choose the best available based on the OS and GPU.

FIX: Performance regression using UMA buffers

UMA buffers for Metal does not use argument buffers when using a UMA buffer, which is all canvas 2D rendering. With the previous implementation, all slots were updated every time each time a uniform set changed. For 2D rendering, when a texture changes frequently, this resulted in costly calls to the Metal command encoder to encode all slots, even if it was only the texture, and possibly the sampler, had changed. This update caches the slots that have changed, so only the minimal Metal binding calls are executed. This should improve performance across the board for all devices using direct / slot binding in Metal


IMPROVEMENT: Reduce memory and CPU usage

The changes to use stable argument buffer bindings means that Metal shaders generated from SPIR-V now produce consistent argument buffer layouts across shader versions and pipeline stages, by using the information from the RenderingShaderContainer. This class has had some improvements to include additional reflected data that is passed to the device-specific shader containers.

By ensuring argument buffer layout is consistent, we no longer have to generate an argument buffer per shader version and stage, which reduces the calculation and layout of 100s per shader variant, in some cases! This was happening for every material in the Bistro demo, which had 100s of materials. That resulted in unique argument buffers for every shader material.

Important

These changes are also preparation for adding Metal 4 support in the future

These changes had small improvements across the board for the Godot reflection benchmark.

  • 45 gm is the current 4.5.1 version
  • 46 args disables is when argument buffers are disabled, and slot or direct binding is used
  • 46 args enabled is when argument buffers are enabled

FPS

Godot Version Description FPS Mean FPS Median FPS 5% Low FPS 99% High
4.5 gm 195 185 128 380
4.6 dev args disabled 197 186 130 383
4.6 dev args enabled 199 189 132 392

GPU times

Godot Version Description Frames GPU Time Mean (ms) GPU Time Median (ms) GPU Time 99% (ms)
4.5 gm 4995 1.95 1.89 3.77
4.6 dev args disabled 4995 1.94 1.91 3.20
4.6 dev args enabled 4996 1.87 1.79 3.02

Memory improvements

Savings of about 1MB with fewer argument buffer allocations

Godot Version Description GPU Memory Mean (MB) Process Memory Mean (MB) Process Memory Max (MB)
4.5 gm 788 1,585 1,590
4.6 dev args disabled 787 1,529 1,536
4.6 dev args enabled 787 1,521 1,526

@stuartcarnie stuartcarnie force-pushed the metal_stable_bindings branch 9 times, most recently from 752e821 to 130c7c5 Compare October 24, 2025 04:27
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Device Profile is now keyed by platform (macOS, iOS, etc), GPU and minimum OS version. This ensures that when generating or baking the shader, it selects the correct features based on the target OS also.

Comment on lines +109 to +117
/*! Track resource and ensure they are resident prior to dispatch or draw commands.
*
* The primary purpose of this data structure is to track all the resources that must be made resident prior
* to issuing the next dispatch or draw command. It aggregates all resources used from argument buffers.
*
* As an optimization, this data structure also tracks previous usage for resources, so that
* it may avoid binding them again in later commands if the resource is already resident and its usage flagged.
*/
struct API_AVAILABLE(macos(11.0), ios(14.0), tvos(14.0)) ResourceTracker {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes GPU corruption / crashes by tracking resource usage and ensuring they are resident prior to each command (draw, dispatch, etc)

Comment on lines +582 to +700
void resolve_texture(RDD::TextureID p_src_texture, RDD::TextureLayout p_src_texture_layout, uint32_t p_src_layer, uint32_t p_src_mipmap, RDD::TextureID p_dst_texture, RDD::TextureLayout p_dst_texture_layout, uint32_t p_dst_layer, uint32_t p_dst_mipmap);
void clear_color_texture(RDD::TextureID p_texture, RDD::TextureLayout p_texture_layout, const Color &p_color, const RDD::TextureSubresourceRange &p_subresources);
void clear_buffer(RDD::BufferID p_buffer, uint64_t p_offset, uint64_t p_size);
void copy_buffer(RDD::BufferID p_src_buffer, RDD::BufferID p_dst_buffer, VectorView<RDD::BufferCopyRegion> p_regions);
void copy_texture(RDD::TextureID p_src_texture, RDD::TextureID p_dst_texture, VectorView<RDD::TextureCopyRegion> p_regions);
void copy_buffer_to_texture(RDD::BufferID p_src_buffer, RDD::TextureID p_dst_texture, VectorView<RDD::BufferTextureCopyRegion> p_regions);
void copy_texture_to_buffer(RDD::TextureID p_src_texture, RDD::BufferID p_dst_buffer, VectorView<RDD::BufferTextureCopyRegion> p_regions);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the implementation of these from the RenderingDeviceDriverMetal into MDCommandBuffer, for consistency


public:
uint32_t index;
id<MTLBuffer> arg_buffer = nil;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we have a single argument buffer per uniform set vs 100s or more

return blit.encoder;
}

_FORCE_INLINE_ static MTLSize mipmapLevelSizeFromTexture(id<MTLTexture> p_tex, NSUInteger p_level) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following block was moved from the RenderingDeviceDriverMetal into here, to be consistent with the other functions.

Comment on lines +67 to +78
switch (device_profile->platform) {
case MetalDeviceProfile::Platform::macOS: {
parts.push_back("-mtargetos=macos" + device_profile->min_os_version.to_compiler_os_version());
break;
}
case MetalDeviceProfile::Platform::iOS: {
parts.push_back("-mtargetos=ios" + device_profile->min_os_version.to_compiler_os_version());
break;
}
case MetalDeviceProfile::Platform::visionOS: {
parts.push_back("-mtargetos=xros" + device_profile->min_os_version.to_compiler_os_version());
break;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to account for visionOS when generating Metal binaries


typedef LocalVector<ReflectUniform> ReflectDescriptorSet;

struct ReflectShader {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We define the reflect objects in the Shader Container, so that data flows outwards from Shader Container. It allows us to evolve what we reflect that is passed to the driver-specific shader containers.

Further, the ReflectShader type is passed to the driver-specific implementations to inspect the reflected SPIR-V.

Previously we were traversing the reflected SPIR-V and constructing RDD::ShaderReflection, which is used by the drivers and RenderingDriver. We were also using ShaderReflection to construct the internal state of the RenderingShaderContainer and also constructing the ShaderReflection from the internal state. We wanted to add more metadata to ShaderReflection, so Metal could build stable bindings, but that would mean changing ShaderReflection.

Comment on lines +47 to +49
} else if (os_name == U"visionOS") {
min_os_version = (String)p_preset->get("application/min_visionos_version");
profile = MetalDeviceProfile::get_profile(MetalDeviceProfile::Platform::visionOS, MetalDeviceProfile::GPU::Apple8, min_os_version);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure we can bake shaders for visionOS

@stuartcarnie stuartcarnie marked this pull request as ready for review October 24, 2025 09:28
@stuartcarnie stuartcarnie requested review from a team as code owners October 24, 2025 09:28
@stuartcarnie
Copy link
Contributor Author

❤️ Thanks for the feedback, @AThousandShips – will incorporate all your changes!

@stuartcarnie stuartcarnie force-pushed the metal_stable_bindings branch from 14fa0a2 to 7660797 Compare October 24, 2025 21:35
@stuartcarnie
Copy link
Contributor Author

Thanks @AThousandShips – all your feedback has been incorporated

@stuartcarnie stuartcarnie force-pushed the metal_stable_bindings branch from 7660797 to 1f183b1 Compare October 26, 2025 20:12
MDRenderPass(Vector<MDAttachment> &p_attachments, Vector<MDSubpass> &p_subpasses);
};

struct BindingCache {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BindingCache is used to avoid redundant binding calls to a MTLCommandEncoder

Comment on lines -763 to -771
class API_AVAILABLE(macos(11.0), ios(14.0), tvos(14.0)) DynamicOffsets {
uint32_t data;

public:
_FORCE_INLINE_ uint32_t get_frame_index(const DynamicOffsetLayout &p_layout) const {
return data;
}
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed dead code from #111183

Comment on lines +929 to +939
// A type used to encode resources directly to a MTLCommandEncoder
struct DirectEncoder {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows us to greatly simplify the direct binding code, but unifying MTLRenderCommandEncoder and MTLComputeCommandEncoder binding and caching

@stuartcarnie stuartcarnie force-pushed the metal_stable_bindings branch from 1f183b1 to efb8003 Compare October 27, 2025 00:14
@stuartcarnie stuartcarnie force-pushed the metal_stable_bindings branch from efb8003 to 97c17ae Compare October 27, 2025 21:45
Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go ahead with this.

I still have a bit of reservation about the duplication between RDC and ShaderContainer that this introduced. But I understand your rationale for it and can't think of a better option. I don't want to block this work due to my hesitation since it is most likely a result of my lack of familiarity with the ShaderContainer code.

So to move this forward I suggest that we merge this as-is. Then, when Dario is back from vacation, I will ask him to take a look as well and point out if there are any potential issues, or perhaps a better way to avoid the duplication that neither of us are seeing.

@clayjohn clayjohn added bug and removed enhancement labels Oct 28, 2025
@clayjohn clayjohn modified the milestones: 4.x, 4.6 Oct 28, 2025
@Repiteo Repiteo merged commit 8bae34a into godotengine:master Oct 28, 2025
20 checks passed
@Repiteo
Copy link
Contributor

Repiteo commented Oct 28, 2025

Thanks!

@stuartcarnie stuartcarnie deleted the metal_stable_bindings branch October 28, 2025 18:36
@stuartcarnie
Copy link
Contributor Author

I still have a bit of reservation about the duplication between RDC and ShaderContainer that this introduced. But I understand your rationale for it and can't think of a better option. I don't want to block this work due to my hesitation since it is most likely a result of my lack of familiarity with the ShaderContainer code.

Thanks @clayjohn – and I agree.

I will spend some time looking at how this could be improved as a more targeted PR that doesn't have as many broad changes. I realise this turned into a large change, which wasn't my intention or at all ideal, but there were many stones unturned…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants