M1/M2: Large matrix multiplications can contains NaNs #381

chengchingwen · 2024-07-04T05:15:39Z

MWE:

julia> a = Metal.randn(10000, 10000);

julia> b = Metal.randn(10000, 10000);

julia> c = a * b';

julia> for i in 1:10
           C = Metal.zeros(Float32, size(a))
           mul!(C, a, b')
           @assert C ≈ c "$i"
       end
ERROR: AssertionError: 1
Stacktrace:
 [1] top-level scope
   @ ./REPL[58]:4

julia> for i in 1:10
           C = Metal.zeros(Float32, size(a))
           mul!(C, a, b')
           @assert C ≈ c "$i"
       end
ERROR: AssertionError: 8
Stacktrace:
 [1] top-level scope
   @ ./REPL[58]:4

julia> for i in 1:10
           @assert a * b' ≈ c "$i"
       end
ERROR: AssertionError: 3
Stacktrace:
 [1] top-level scope
   @ ./REPL[59]:2

julia> for i in 1:10
           @assert a * b' ≈ c "$i"
       end
ERROR: AssertionError: 8
Stacktrace:
 [1] top-level scope
   @ ./REPL[59]:2

chengchingwen · 2024-07-04T06:23:13Z

adding wait_completed on matmul!'s command buffer does not help

christiangnrd · 2024-07-04T12:41:46Z

Adding Metal.@sync to the mul! also does not help. ~~However, I cannot reproduce when calling MPS.matmul! directly.~~

maleadt · 2024-07-05T13:49:33Z

I cannot reproduce at all on Metal.jl#master using an M3 Pro, but it does seem reproducible on an M1 Pro.

I wonder if this is a problem with mapreduce, since you're calling isapprox on GPU arrays. Can you test if calling @assert Array(C) ≈ Array(c) makes things pass? It does here, at least.

tgymnich · 2024-07-05T13:52:16Z

I can reproduce the issue on M1 master. It also looks like all the tasks run on the same queue.

chengchingwen · 2024-07-05T14:10:03Z

The issue was found on a M2 Max. The MWE only happens if the array is large enough. It seems to be launching the subsequent kernel before the matmul finished. Is it possible that the mapreduce not checking the availability of the input arrays?

p.s. I'm about to board the plane to JuliaCon so I won't be able to test it soon.

maleadt · 2024-07-05T14:21:05Z

I wonder if this is a problem with mapreduce, since you're calling isapprox on GPU arrays. Can you test if calling @assert Array(C) ≈ Array(c) makes things pass? It does here, at least.

It also reproduces when comparing on the CPU, just much less likely, so this isn't a mapreduce issue.

maleadt · 2024-07-05T14:24:25Z

Looks like a bunch of NaN's in the second matrix.

christiangnrd · 2024-07-05T14:24:43Z

My current MWE is:

using Metal, LinearAlgebra; begin
    n = 10000
    a = mtl(randn(Float32,n,n))
    b = mtl(randn(Float32,n,n))
    C = Metal.zeros(Float32, size(a))
    for i in 1:10
        C = Metal.zeros(Float32, size(a))
        mul!(C,a,b)
        @assert !any(isnan.(C)) "$i"
    end
end

I define C out of the loop to access it afterwards. When I had C .= ... in the loop instead of C = .... It only ever happened at iteration 1. I suspect it has to do with the location in memory of the array.

maleadt · 2024-07-05T14:44:41Z

I cannot reproduce when calling MPS.matmul! directly

I can:

using Metal, LinearAlgebra

function main(T=Float32, N=10000)
    a = Metal.rand(T, N, N)
    b = Metal.rand(T, N, N)
    c = a * b'
    synchronize()

    for i in 1:100
        println("Iteration $i")
        d = Metal.zeros(T, size(a))
        MPS.matmul!(d, a, b, #=alpha=#true, #=beta=#false,
                    #=transpose_a=#false, #=transpose_b=#true)
        @assert !any(isnan.(Array(d))) "NaN in iteration $i"

        # XXX: this redundant check is needed, or the failure never occurs
        @assert !any(isnan.(d))
    end
end

isinteractive() || main()

The need for a secondary kernel is very weird.

tgymnich · 2024-07-05T14:58:43Z

It is ~~not~~ MPS related:

 for i in 1:10
       C = Metal.zeros(Float32, size(a))
       GPUArrays.generic_matmatmul!(C, a, b, MulAddMul())
       @assert C ≈ c "$i"
end

maleadt · 2024-07-05T15:13:53Z

GPUArrays.generic_matmatmul!(C, a, b, MulAddMul())

I don't see how that's related; it's an entirely different kernel. Does it contain NaNs in similar places?
The generic matmatmul kernel, while being extraordinarily slow, doesn't introduce NaNs here.

tgymnich · 2024-07-05T15:17:20Z

Just wanted to confirm that its MPS rather than the synchronisation between kernel launches.

tgymnich · 2024-07-05T15:18:54Z

I've been seeing the NaN issues with large arrays for a long time in #145

MPX seems fine:

import mlx.core as mx

a = mx.random.normal((10000, 10000))
b = mx.random.normal((10000, 10000))
c = a @ b.T


for i in range(0,10):
    C = a @ b.T
    assert(mx.allclose(C,c))

christiangnrd · 2024-07-12T21:07:04Z

I would love for someone to review my code because I'm not a Swift expert by any means, but I was able to reproduce this in the Swift REPL.

Swift MWE


import Metal 
import MetalPerformanceShaders
 
func main(T: Float.Type = Float32.self, N: Int = 10000) { 
    guard let device = MTLCreateSystemDefaultDevice(), 
          let commandQueue = device.makeCommandQueue() else { 
        fatalError("Metal device or command queue could not be created") 
          } 
     
    print("Initializing a & b") 
    // Generate random NxN matrices 
    var a = [Float](repeating: 1, count: N * N) 
    var b = [Float](repeating: 1, count: N * N) 
 
    print("a and b created\n") 
    // Metal buffers for matrices 
    let aBuffer = device.makeBuffer(bytes: &a, length: MemoryLayout<Float>.size * N * N, options: []) 
    let bBuffer = device.makeBuffer(bytes: &b, length: MemoryLayout<Float>.size * N * N, options: []) 
 
    print("Starting matmul\n") 
    for i in 1...10 { 
        print(i) 
        print("\n") 
        // Create MPSMatrices 
        let aMatrixDescriptor = MPSMatrixDescriptor(rows: N, columns: N, rowBytes: MemoryLayout<Float>.size * N, dataType: .float32) 
        let bMatrixDescriptor = MPSMatrixDescriptor(rows: N, columns: N, rowBytes: MemoryLayout<Float>.size * N, dataType: .float32) 
 
 
        let aMatrix = MPSMatrix(buffer: aBuffer!, descriptor: aMatrixDescriptor) 
        let bMatrix = MPSMatrix(buffer: bBuffer!, descriptor: bMatrixDescriptor) 
 
        // Matrix multiplication using MPSMatrixMultiplication 
        let matrixMultiplication = MPSMatrixMultiplication(device: device, 
        transposeLeft: false, 
        transposeRight: false, 
        resultRows: N, 
        resultColumns: N, 
        interiorColumns: N, 
        alpha: 1.0, 
        beta: 0.0) 
        let cBuffer = device.makeBuffer(length: MemoryLayout<Float>.size * N * N, options: []) 
        let cMatrixDescriptor = MPSMatrixDescriptor(rows: N, columns: N, rowBytes: MemoryLayout<Float>.size * N, dataType: .float32) 
        let cMatrix = MPSMatrix(buffer: cBuffer!, descriptor: cMatrixDescriptor) 
         
        let commandBuffer = commandQueue.makeCommandBuffer()! 
        matrixMultiplication.encode(commandBuffer: commandBuffer, 
        leftMatrix: aMatrix, 
        rightMatrix: bMatrix, 
        resultMatrix: cMatrix) 
        commandBuffer.commit() 
        commandBuffer.waitUntilCompleted() 

        // Check for NaNs in the result matrix 
        let cPointer = cBuffer!.contents().bindMemory(to: Float.self, capacity: N * N) 
        var j = 0
        while j < N*N {
            if cPointer[j].isNaN {
                fatalError("NaN in iteration \(i)")
            }
            j += 1
        }
    } 
}
 
Output:
Initializing a & b
a and b created

Starting matmul

1


2


3


4


__lldb_expr_3/repl.swift:56: Fatal error: NaN in iteration 4
2024-07-12 17:58:38.583349-0300 repl_swift[1500:21665] __lldb_expr_3/repl.swift:56: Fatal error: NaN in iteration 4
Execution interrupted. Enter code to recover and continue.
Enter LLDB commands to investigate (type :help for assistance.)

tgymnich · 2024-07-12T21:17:42Z

@christiangnrd Your Swift Code looks good to me. It turns out MPX doesn’t even use MPS.

maleadt · 2024-07-13T10:24:36Z

Haven't been able to look into this, but here's the ObjC version:

#import <Foundation/Foundation.h>
#import <Metal/Metal.h>
#import <MetalPerformanceShaders/MetalPerformanceShaders.h>

void performMatrixMultiplication(NSInteger N) {
    if (N == 0) {
        N = 10000;
    }
    
    id<MTLDevice> device = MTLCreateSystemDefaultDevice();
    id<MTLCommandQueue> commandQueue = [device newCommandQueue];
    
    if (!device || !commandQueue) {
        NSLog(@"Metal device or command queue could not be created");
        return;
    }
    
    NSLog(@"Initializing a & b");
    // Generate random NxN matrices
    float *a = calloc(N * N, sizeof(float));
    float *b = calloc(N * N, sizeof(float));
    
    for (NSInteger i = 0; i < N * N; i++) {
        a[i] = 1.0f;
        b[i] = 1.0f;
    }
    
    NSLog(@"a and b created\n");
    // Metal buffers for matrices
    id<MTLBuffer> aBuffer = [device newBufferWithBytes:a length:sizeof(float) * N * N options:MTLResourceStorageModeShared];
    id<MTLBuffer> bBuffer = [device newBufferWithBytes:b length:sizeof(float) * N * N options:MTLResourceStorageModeShared];
    
    NSLog(@"Starting matmul\n");
    for (NSInteger i = 1; i <= 10; i++) {
        NSLog(@"%ld\n", (long)i);
        
        // Create MPSMatrices
        MPSMatrixDescriptor *aMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
                                                                                       columns:N
                                                                                      rowBytes:sizeof(float) * N
                                                                                      dataType:MPSDataTypeFloat32];
        MPSMatrixDescriptor *bMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
                                                                                       columns:N
                                                                                      rowBytes:sizeof(float) * N
                                                                                      dataType:MPSDataTypeFloat32];
        
        MPSMatrix *aMatrix = [[MPSMatrix alloc] initWithBuffer:aBuffer descriptor:aMatrixDescriptor];
        MPSMatrix *bMatrix = [[MPSMatrix alloc] initWithBuffer:bBuffer descriptor:bMatrixDescriptor];
        
        // Matrix multiplication using MPSMatrixMultiplication
        MPSMatrixMultiplication *matrixMultiplication = [[MPSMatrixMultiplication alloc] initWithDevice:device
                                                                                          transposeLeft:NO
                                                                                         transposeRight:NO
                                                                                             resultRows:N
                                                                                          resultColumns:N
                                                                                       interiorColumns:N
                                                                                                 alpha:1.0
                                                                                                  beta:0.0];
        
        id<MTLBuffer> cBuffer = [device newBufferWithLength:sizeof(float) * N * N options:MTLResourceStorageModeShared];
        MPSMatrixDescriptor *cMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
                                                                                       columns:N
                                                                                      rowBytes:sizeof(float) * N
                                                                                      dataType:MPSDataTypeFloat32];
        MPSMatrix *cMatrix = [[MPSMatrix alloc] initWithBuffer:cBuffer descriptor:cMatrixDescriptor];
        
        id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
        [matrixMultiplication encodeToCommandBuffer:commandBuffer
                                         leftMatrix:aMatrix
                                        rightMatrix:bMatrix
                                       resultMatrix:cMatrix];
        [commandBuffer commit];
        [commandBuffer waitUntilCompleted];
        
        // Check for NaNs in the result matrix
        float *cPointer = cBuffer.contents;
        for (NSInteger j = 0; j < N * N; j++) {
            if (isnan(cPointer[j])) {
                NSLog(@"NaN in iteration %ld", (long)i);
                free(a);
                free(b);
                return;
            }
        }
    }
    
    free(a);
    free(b);
}

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        NSInteger N = 10000;
        if (argc > 1) {
            N = atoi(argv[1]);
        }
        performMatrixMultiplication(N);
    }
    return 0;
}

❯ clang mps.m -o mps -framework Foundation -framework Metal -framework MetalPerformanceShaders -fobjc-arc -mmacosx-version-min=10.13

❯ ./mps
2024-07-13 12:23:11.771 mps[54256:2493528] Initializing a & b
2024-07-13 12:23:11.931 mps[54256:2493528] a and b created
2024-07-13 12:23:12.001 mps[54256:2493528] Starting matmul
2024-07-13 12:23:12.001 mps[54256:2493528] 1
2024-07-13 12:23:13.933 mps[54256:2493528] 2
2024-07-13 12:23:15.477 mps[54256:2493528] 3
2024-07-13 12:23:16.997 mps[54256:2493528] 4
2024-07-13 12:23:18.440 mps[54256:2493528] NaN in iteration 4

tgymnich · 2024-07-13T12:46:51Z

Should we just file a radar / feedback?

maleadt · 2024-07-13T12:59:37Z

I'll have a better look first and forward it to our Apple contact.

maleadt · 2024-08-28T08:22:06Z

Apparently this looks like an ARC bug. Curiously, the ObjC reproducer is "fixed" by adding an @autoreleasepool around the for loop body, but the same doesn't hold in Julia (in fact, the original issue was calling into mul! which is already marked @autoreleasepool).

Of course, the Julia MWE is more complex, as the @assert !any(isnan.(d)) involves two additional kernels...

Still broken Julia MWE

using Metal, LinearAlgebra
using ObjectiveC, .Foundation

function main(T=Float32, N=10000)
    a = Metal.rand(T, N, N)
    b = Metal.rand(T, N, N)
    synchronize()

    for i in 1:100
        @autoreleasepool begin
            println("Iteration $i")
            d = Metal.zeros(T, size(a))
            MPS.matmul!(d, a, b, #=alpha=#true, #=beta=#false,
                        #=transpose_a=#false, #=transpose_b=#false)
            @assert !any(isnan.(Array(d))) "NaN in iteration $i"

            # XXX: this redundant check is needed, or the failure never occurs
            @assert !any(isnan.(d))
        end
    end
end

isinteractive() || main()

"Fixed" ObjeC MWE

#import <Foundation/Foundation.h>
#import <Metal/Metal.h>
#import <MetalPerformanceShaders/MetalPerformanceShaders.h>

void performMatrixMultiplication(NSInteger N) {
    if (N == 0) {
        N = 10000;
    }

    id<MTLDevice> device = MTLCreateSystemDefaultDevice();
    id<MTLCommandQueue> commandQueue = [device newCommandQueue];

    if (!device || !commandQueue) {
        NSLog(@"Metal device or command queue could not be created");
        return;
    }

    NSLog(@"Initializing a & b");
    // Generate random NxN matrices
    float *a = calloc(N * N, sizeof(float));
    float *b = calloc(N * N, sizeof(float));

    for (NSInteger i = 0; i < N * N; i++) {
        a[i] = 1.0f;
        b[i] = 1.0f;
    }

    NSLog(@"a and b created\n");
    // Metal buffers for matrices
    id<MTLBuffer> aBuffer = [device newBufferWithBytes:a length:sizeof(float) * N * N options:MTLResourceStorageModeShared];
    id<MTLBuffer> bBuffer = [device newBufferWithBytes:b length:sizeof(float) * N * N options:MTLResourceStorageModeShared];

    NSLog(@"Starting matmul\n");
    for (NSInteger i = 1; i <= 100; i++) {
        @autoreleasepool {
            NSLog(@"Iteration %ld\n", (long)i);

            // Create MPSMatrices
            MPSMatrixDescriptor *aMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
                                                                                           columns:N
                                                                                          rowBytes:sizeof(float) * N
                                                                                          dataType:MPSDataTypeFloat32];
            MPSMatrixDescriptor *bMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
                                                                                           columns:N
                                                                                          rowBytes:sizeof(float) * N
                                                                                          dataType:MPSDataTypeFloat32];

            MPSMatrix *aMatrix = [[MPSMatrix alloc] initWithBuffer:aBuffer descriptor:aMatrixDescriptor];
            MPSMatrix *bMatrix = [[MPSMatrix alloc] initWithBuffer:bBuffer descriptor:bMatrixDescriptor];

            // Matrix multiplication using MPSMatrixMultiplication
            MPSMatrixMultiplication *matrixMultiplication = [[MPSMatrixMultiplication alloc] initWithDevice:device
                                                                                              transposeLeft:NO
                                                                                             transposeRight:NO
                                                                                                 resultRows:N
                                                                                              resultColumns:N
                                                                                           interiorColumns:N
                                                                                                     alpha:1.0
                                                                                                      beta:0.0];

            id<MTLBuffer> cBuffer = [device newBufferWithLength:sizeof(float) * N * N options:MTLResourceStorageModeShared];
            MPSMatrixDescriptor *cMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
                                                                                           columns:N
                                                                                          rowBytes:sizeof(float) * N
                                                                                          dataType:MPSDataTypeFloat32];
            MPSMatrix *cMatrix = [[MPSMatrix alloc] initWithBuffer:cBuffer descriptor:cMatrixDescriptor];

            id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
            [matrixMultiplication encodeToCommandBuffer:commandBuffer
                                             leftMatrix:aMatrix
                                            rightMatrix:bMatrix
                                           resultMatrix:cMatrix];
            [commandBuffer commit];
            [commandBuffer waitUntilCompleted];

            // Check for NaNs in the result matrix
            float *cPointer = cBuffer.contents;
            for (NSInteger j = 0; j < N * N; j++) {
                if (isnan(cPointer[j])) {
                    NSLog(@"NaN in iteration %ld", (long)i);
                    free(a);
                    free(b);
                    return;
                }
            }
        }
    }

    free(a);
    free(b);
}

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        NSInteger N = 10000;
        if (argc > 1) {
            N = atoi(argv[1]);
        }
        performMatrixMultiplication(N);
    }
    return 0;
}

tgymnich · 2024-08-28T15:42:35Z

Couldn't reproduce the ObjectiveC case today with and without autoreleasepool.
Swift and Julia were still reproducible.

christiangnrd · 2024-08-28T18:11:25Z

I can reproduce the error in both Swift and ObjectiveC and it goes away when surrounded by an autoreleasepool block in both languages.

tgymnich · 2024-08-28T18:17:20Z

Oops. I just overlooked the second autoreleasepool. The first one is actually not necessary (at least to hide our bug.)

christiangnrd · 2024-08-28T18:21:43Z

By "the first one" do you mean the autoreleasepool in main?

christiangnrd · 2024-08-28T18:51:46Z

I'm able to reproduce this without the second redundant check.

Still broken simpler Julia MWE

using Metal, LinearAlgebra
using ObjectiveC, .Foundation

function main(T=Float32, N=10000)
    a = Metal.rand(T, N, N)
    b = Metal.rand(T, N, N)
    synchronize()

    for i in 1:100
        # @autoreleasepool begin
        begin
            println("Iteration $i")
            d = Metal.zeros(T, size(a))
            MPS.matmul!(d, a, b, #=alpha=#true, #=beta=#false,
                        #=transpose_a=#false, #=transpose_b=#false)
            @assert !any(isnan.(Array(d))) "NaN in iteration $i"
        end
    end
end

isinteractive() || main()

tgymnich · 2024-09-25T18:39:14Z

Our NSAutoreleasePool seems to contain roughly the same objects before the nan check compared to the objc version from above. Most obvious difference is that the correct objc version has a CaptureMTLDevice and a AGXG13XFamilyComputeContext and we have a AGXG13XFamilyCommandBuffer (could be debug / xcode related).

iteration 1
objc[6905]: ##############
objc[6905]: AUTORELEASE POOLS for thread 0x203b9b240
objc[6905]: 77 releases pending.
objc[6905]: [0x14300d000]  ................  PAGE  (hot) (cold)
objc[6905]: [0x14300d038]  ################  POOL 0x14300d038
objc[6905]: [0x14300d040]    0x6000004c4860  __NSSingleEntryDictionaryI
objc[6905]: [0x14300d048]    0x6000027ccd20  NSBundle  autorelease count 2
objc[6905]: [0x14300d050]    0x6000004cfd40  __NSDictionaryM  autorelease count 2
objc[6905]: [0x14300d058]    0x600002fc8690  MTLCommandQueueDescriptorInternal
objc[6905]: [0x14300d060]    0x600000ac0090  NSUserDefaults  autorelease count 4
objc[6905]: [0x14300d068]    0x6000004c4b20  __NSSingleEntryDictionaryI
objc[6905]: [0x14300d070]    0x6000004d4660  __NSSingleEntryDictionaryI
objc[6905]: [0x14300d078]    0x6000004d4220  __NSSingleEntryDictionaryI
objc[6905]: [0x14300d080]    0x6000004d46a0  __NSSingleEntryDictionaryI
objc[6905]: [0x14300d088]  ################  POOL 0x14300d088
objc[6905]: [0x14300d090]    0x6000011ce180  MPSMatrixDescriptor
objc[6905]: [0x14300d098]    0x6000011cde00  MPSMatrixDescriptor
objc[6905]: [0x14300d0a0]       0x145809000  AGXG13XDevice  autorelease count 15
objc[6905]: [0x14300d0a8]       0x144105550  CaptureMTLDevice  autorelease count 4
objc[6905]: [0x14300d0b0]    0x6000011cc540  __NSCFString
objc[6905]: [0x14300d0b8]    0x600002ac8d80  __NSCFString
objc[6905]: [0x14300d0c0]    0x600003dcc540  NSPathStore2
objc[6905]: [0x14300d0c8]    0x6000011cc600  __NSBundleTables  autorelease count 3
objc[6905]: [0x14300d0d0]    0x6000027dc140  NSBundle  autorelease count 2
objc[6905]: [0x14300d0d8]    0x6000027cd0e0  NSBundle
objc[6905]: [0x14300d0e0]    0x6000020cc480  NSURL
objc[6905]: [0x14300d0e8]    0x6000035cc500  __NSCFString
objc[6905]: [0x14300d0f0]    0x6000004dd4e0  NSFileManager
objc[6905]: [0x14300d0f8]    0x6000020cc5a0  NSURL
objc[6905]: [0x14300d100]    0x6000035cc280  __NSCFString
objc[6905]: [0x14300d108]    0x6000035cc6e0  __NSCFString  autorelease count 2
objc[6905]: [0x14300d110]    0x6000004df3e0  NSConcreteData
objc[6905]: [0x14300d118]    0x6000027cd810  Swift.__StringStorage
objc[6905]: [0x14300d120]    0x6000027cd860  Swift.__StringStorage
objc[6905]: [0x14300d128]    0x6000027cd8b0  Swift.__StringStorage
objc[6905]: [0x14300d130]    0x6000027cd900  Swift.__StringStorage
objc[6905]: [0x14300d138]    0x6000027cd950  Swift.__StringStorage
objc[6905]: [0x14300d140]    0x6000027cd9a0  Swift.__StringStorage
objc[6905]: [0x14300d148]    0x6000035cc6e0  __NSCFString  autorelease count 6
objc[6905]: [0x14300d150]    0x6000011d4980  MPSMatrixDescriptor
objc[6905]: [0x14300d158]       0x144105550  CaptureMTLDevice  autorelease count 2
objc[6905]: [0x14300d160]    0x6000036ceeb0  AGXG13XFamilyComputeContext
objc[6905]: [0x14300d168]    0x6000011d4b80  __NSCFString
objc[6905]: [0x14300d170]    0x600002acaf80  __NSCFString
objc[6905]: [0x14300d178]    0x600003dcc2a0  NSPathStore2
objc[6905]: [0x14300d180]    0x6000011cc600  __NSBundleTables  autorelease count 3
objc[6905]: [0x14300d188]    0x6000027cd0e0  NSBundle
objc[6905]: [0x14300d190]    0x6000027dc140  NSBundle
objc[6905]: [0x14300d198]    0x6000027cda90  NSBundle  autorelease count 2
objc[6905]: [0x14300d1a0]    0x6000020cc600  NSURL
objc[6905]: [0x14300d1a8]    0x6000035cc8c0  __NSCFString
objc[6905]: [0x14300d1b0]    0x6000004df980  NSFileManager
objc[6905]: [0x14300d1b8]    0x6000020cc6c0  NSURL
objc[6905]: [0x14300d1c0]    0x6000035ccb40  __NSCFString
objc[6905]: [0x14300d1c8]    0x6000035cc960  __NSCFString  autorelease count 2
objc[6905]: [0x14300d1d0]    0x6000004d1e40  NSConcreteData
objc[6905]: [0x14300d1d8]    0x6000027cdbd0  Swift.__StringStorage
objc[6905]: [0x14300d1e0]    0x6000027cdc20  Swift.__StringStorage
objc[6905]: [0x14300d1e8]    0x6000027cdc70  Swift.__StringStorage
objc[6905]: [0x14300d1f0]    0x6000027cdcc0  Swift.__StringStorage
objc[6905]: [0x14300d1f8]    0x6000027cdd10  Swift.__StringStorage
objc[6905]: [0x14300d200]    0x6000027cdd60  Swift.__StringStorage
objc[6905]: [0x14300d208]    0x6000035cc960  __NSCFString  autorelease count 6
objc[6905]: [0x14300d210]       0x144105550  CaptureMTLDevice  autorelease count 2
objc[6905]: [0x14300d218]    0x600000a80330  __NSArrayM
objc[6905]: [0x14300d220]    0x600000a80360  __NSArrayM
objc[6905]: [0x14300d228]    0x6000004d2f40  __NSCFString
objc[6905]: [0x14300d230]    0x6000004d2ec0  __NSCFString
objc[6905]: [0x14300d238]    0x6000004d2ee0  __NSCFString
objc[6905]: [0x14300d240]    0x6000004d2f00  __NSCFString
objc[6905]: [0x14300d248]    0x6000008e7240  __NSCFString
objc[6905]: [0x14300d250]    0x6000008e7090  __NSCFString
objc[6905]: [0x14300d258]    0x6000008e70c0  __NSCFString
objc[6905]: [0x14300d260]    0x6000008e70f0  __NSCFString
objc[6905]: [0x14300d268]    0x6000004d2f20  __NSCFString
objc[6905]: [0x14300d270]    0x6000004d2ea0  __NSCFString
objc[6905]: [0x14300d278]    0x6000004d2fc0  __NSCFString
objc[6905]: [0x14300d280]    0x6000008e6e20  __NSArrayM
objc[6905]: [0x14300d288]    0x6000004d3140  __NSCFNumber
objc[6905]: [0x14300d290]       0x14304b800  __NSCFString
objc[6905]: [0x14300d298]    0x6000020cc780  MTLComputePipelineReflectionInternal
objc[6905]: ##############
iteration 2
objc[36563]: ##############
objc[36563]: AUTORELEASE POOLS for thread 0x203b9b240
objc[36563]: 16 releases pending.
objc[36563]: [0x14080a000]  ................  PAGE  (hot) (cold)
objc[36563]: [0x14080a038]  ################  POOL 0x14080a038
objc[36563]: [0x14080a040]    0x600001f3c5a0  __NSSingleEntryDictionaryI
objc[36563]: [0x14080a048]    0x600003c202d0  NSBundle  autorelease count 2
objc[36563]: [0x14080a050]    0x600001f2e7a0  __NSDictionaryM  autorelease count 2
objc[36563]: [0x14080a058]    0x60000342c0e0  MTLCommandQueueDescriptorInternal
objc[36563]: [0x14080a060]    0x60000112c2a0  NSUserDefaults  autorelease count 4
objc[36563]: [0x14080a068]    0x600001f3cb00  __NSSingleEntryDictionaryI
objc[36563]: [0x14080a070]    0x600001f3c4e0  __NSSingleEntryDictionaryI
objc[36563]: [0x14080a078]    0x600001f3cac0  __NSSingleEntryDictionaryI
objc[36563]: [0x14080a080]    0x600001f3cae0  __NSSingleEntryDictionaryI
objc[36563]: [0x14080a088]  ################  POOL 0x14080a088
objc[36563]: [0x14080a090]    0x600000aa2740  MPSMatrixDescriptor
objc[36563]: [0x14080a098]    0x600000aa2780  MPSMatrixDescriptor
objc[36563]: [0x14080a0a0]    0x600000a21040  MPSMatrixDescriptor
objc[36563]: [0x14080a0a8]       0x141005410  CaptureMTLDevice  autorelease count 6
objc[36563]: [0x14080a0b0]    0x600002d24510  AGXG13XFamilyComputeContext
objc[36563]: ##############

Iteration 1
objc[6186]: ##############
objc[6186]: AUTORELEASE POOLS for thread 0x203b9b240
objc[6186]: 20 releases pending.
objc[6186]: [0x12e00b000]  ................  PAGE  (hot) (cold)
objc[6186]: [0x12e00b038]       0x12d20c0f0  _NSSwiftProcessInfo
objc[6186]: [0x12e00b040]       0x12d304d20  Swift.__SwiftDeferredNSArray
objc[6186]: [0x12e00b048]       0x12d304f30  __NSCFCharacterSet
objc[6186]: [0x12e00b050]       0x12d3061e0  __NSCFString
objc[6186]: [0x12e00b058]       0x12c64cdf0  __NSCFString
objc[6186]: [0x12e00b060]       0x12c79d6b0  __NSCFString
objc[6186]: [0x12e00b068]  ################  POOL 0x12e00b068
objc[6186]: [0x12e00b070]       0x11c635370  __NSCFString
objc[6186]: [0x12e00b078]       0x141619730  MPSMatrixDescriptor
objc[6186]: [0x12e00b080]       0x1491eabe0  MPSMatrixDescriptor
objc[6186]: [0x12e00b088]       0x14911b550  MPSMatrixDescriptor
objc[6186]: [0x12e00b090]       0x1496f2b40  __NSCFString
objc[6186]: [0x12e00b098]       0x1491a0d80  __NSCFString
objc[6186]: [0x12e00b0a0]       0x13b718b50  __NSBundleTables
objc[6186]: [0x12e00b0a8]       0x12d33d8e0  NSBundle  autorelease count 3
objc[6186]: [0x12e00b0b0]       0x149152250  NSURL
objc[6186]: [0x12e00b0b8]       0x149111be0  __NSCFString
objc[6186]: [0x12e00b0c0]       0x14913f620  AGXG13XFamilyCommandBuffer
objc[6186]: [0x12e00b0c8]       0x14977a970  __NSArrayM
objc[6186]: [0x12e00b0d0]       0x14978b090  __NSArrayM
objc[6186]: ##############
Iteration 2
objc[6186]: ##############
objc[6186]: AUTORELEASE POOLS for thread 0x203b9b240
objc[6186]: 12 releases pending.
objc[6186]: [0x12e00b000]  ................  PAGE  (hot) (cold)
objc[6186]: [0x12e00b038]       0x12d20c0f0  _NSSwiftProcessInfo
objc[6186]: [0x12e00b040]       0x12d304d20  Swift.__SwiftDeferredNSArray
objc[6186]: [0x12e00b048]       0x12d304f30  __NSCFCharacterSet
objc[6186]: [0x12e00b050]       0x12d3061e0  __NSCFString
objc[6186]: [0x12e00b058]       0x12c64cdf0  __NSCFString
objc[6186]: [0x12e00b060]       0x12c79d6b0  __NSCFString
objc[6186]: [0x12e00b068]  ################  POOL 0x12e00b068
objc[6186]: [0x12e00b070]       0x12c7ff7d0  __NSCFString
objc[6186]: [0x12e00b078]       0x13b7c3da0  MPSMatrixDescriptor
objc[6186]: [0x12e00b080]       0x13b7fc3b0  MPSMatrixDescriptor
objc[6186]: [0x12e00b088]       0x13b714930  MPSMatrixDescriptor
objc[6186]: [0x12e00b090]       0x148cb99a0  AGXG13XFamilyCommandBuffer
objc[6186]: ##############

[NSAutoreleasePool showPools]

christiangnrd · 2024-09-26T01:21:34Z

Apparently this looks like an ARC bug.

Are we using ARC in Julia?

tgymnich · 2024-09-26T07:27:17Z

We don’t use ARC, but the libraries we are using might have been compiled with ARC enabled.

christiangnrd · 2024-09-26T12:03:04Z

When I turned off ARC in XCode for the objc version, even with the @autoreleasepool blocks the NaNs show up.

maleadt · 2024-09-26T13:08:38Z

When I turned off ARC in XCode for the objc version, even with the @autoreleasepool blocks the NaNs show up.

AFAIU -fobjc-arc make the compiler automatically insert release/retain/autorelease calls, and doesn't affect how precompiled libraries like MPS may behave.

christiangnrd · 2024-09-28T03:24:35Z

When I turned off ARC in XCode for the objc version, even with the @autoreleasepool blocks the NaNs show up.

AFAIU -fobjc-arc make the compiler automatically insert release/retain/autorelease calls, and doesn't affect how precompiled libraries like MPS may behave.

That's my understanding too. However, from what I understand about our implementation of the @autoreleasepool macro, we're using an NSAutoreleasePool object and a [pool release]; statement at the end, which according to the documentation, isn't possible with ARC on. By turning ARC off for the objc version, I was trying to reproduce the conditions of the failing Julia code.

The only thing is that I don't know it this information is actually helpful.

christiangnrd added the bug label Jul 4, 2024

maleadt changed the title ~~matrix multiplication not always synchronized~~ M1/M1: Large matrix multiplications can contains NaNs Jul 5, 2024

christiangnrd changed the title ~~M1/M1: Large matrix multiplications can contains NaNs~~ M1/M2: Large matrix multiplications can contains NaNs Jul 5, 2024

christiangnrd added the upstream Out of our hands label Jul 12, 2024

maleadt mentioned this issue Sep 21, 2024

Can't use gemm! methods with Metal #423

Closed

tgymnich removed the bug label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M1/M2: Large matrix multiplications can contains NaNs #381

M1/M2: Large matrix multiplications can contains NaNs #381

chengchingwen commented Jul 4, 2024 •

edited

Loading

chengchingwen commented Jul 4, 2024

christiangnrd commented Jul 4, 2024 •

edited

Loading

maleadt commented Jul 5, 2024

tgymnich commented Jul 5, 2024 •

edited

Loading

chengchingwen commented Jul 5, 2024

maleadt commented Jul 5, 2024

maleadt commented Jul 5, 2024

christiangnrd commented Jul 5, 2024 •

edited

Loading

maleadt commented Jul 5, 2024 •

edited

Loading

tgymnich commented Jul 5, 2024 •

edited

Loading

maleadt commented Jul 5, 2024

tgymnich commented Jul 5, 2024

tgymnich commented Jul 5, 2024 •

edited

Loading

christiangnrd commented Jul 12, 2024 •

edited

Loading

tgymnich commented Jul 12, 2024

maleadt commented Jul 13, 2024

tgymnich commented Jul 13, 2024

maleadt commented Jul 13, 2024

maleadt commented Aug 28, 2024 •

edited

Loading

tgymnich commented Aug 28, 2024

christiangnrd commented Aug 28, 2024 •

edited

Loading

tgymnich commented Aug 28, 2024

christiangnrd commented Aug 28, 2024 •

edited

Loading

christiangnrd commented Aug 28, 2024

tgymnich commented Sep 25, 2024 •

edited

Loading

christiangnrd commented Sep 26, 2024

tgymnich commented Sep 26, 2024

christiangnrd commented Sep 26, 2024 •

edited by maleadt

Loading

maleadt commented Sep 26, 2024

christiangnrd commented Sep 28, 2024

M1/M2: Large matrix multiplications can contains NaNs #381

M1/M2: Large matrix multiplications can contains NaNs #381

Comments

chengchingwen commented Jul 4, 2024 • edited Loading

chengchingwen commented Jul 4, 2024

christiangnrd commented Jul 4, 2024 • edited Loading

maleadt commented Jul 5, 2024

tgymnich commented Jul 5, 2024 • edited Loading

chengchingwen commented Jul 5, 2024

maleadt commented Jul 5, 2024

maleadt commented Jul 5, 2024

christiangnrd commented Jul 5, 2024 • edited Loading

maleadt commented Jul 5, 2024 • edited Loading

tgymnich commented Jul 5, 2024 • edited Loading

maleadt commented Jul 5, 2024

tgymnich commented Jul 5, 2024

tgymnich commented Jul 5, 2024 • edited Loading

christiangnrd commented Jul 12, 2024 • edited Loading

tgymnich commented Jul 12, 2024

maleadt commented Jul 13, 2024

tgymnich commented Jul 13, 2024

maleadt commented Jul 13, 2024

maleadt commented Aug 28, 2024 • edited Loading

tgymnich commented Aug 28, 2024

christiangnrd commented Aug 28, 2024 • edited Loading

tgymnich commented Aug 28, 2024

christiangnrd commented Aug 28, 2024 • edited Loading

christiangnrd commented Aug 28, 2024

tgymnich commented Sep 25, 2024 • edited Loading

christiangnrd commented Sep 26, 2024

tgymnich commented Sep 26, 2024

christiangnrd commented Sep 26, 2024 • edited by maleadt Loading

maleadt commented Sep 26, 2024

christiangnrd commented Sep 28, 2024

chengchingwen commented Jul 4, 2024 •

edited

Loading

christiangnrd commented Jul 4, 2024 •

edited

Loading

tgymnich commented Jul 5, 2024 •

edited

Loading

christiangnrd commented Jul 5, 2024 •

edited

Loading

maleadt commented Jul 5, 2024 •

edited

Loading

tgymnich commented Jul 5, 2024 •

edited

Loading

tgymnich commented Jul 5, 2024 •

edited

Loading

christiangnrd commented Jul 12, 2024 •

edited

Loading

maleadt commented Aug 28, 2024 •

edited

Loading

christiangnrd commented Aug 28, 2024 •

edited

Loading

christiangnrd commented Aug 28, 2024 •

edited

Loading

tgymnich commented Sep 25, 2024 •

edited

Loading

christiangnrd commented Sep 26, 2024 •

edited by maleadt

Loading