Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Adding -O0 causes HIP error: the operation cannot be performed in the present state #126

Open
pvelesko opened this issue Jan 15, 2025 · 10 comments

Comments

@pvelesko
Copy link

Problem Description

hipcc -g pureHip.cpp -o pureHip && ./pureHip
Results:

Thread 0: 1000000000
Thread 1: 1000000001
Thread 2: 1000000002
Thread 3: 1000000003
Thread 4: 1000000004
Thread 5: 1000000005
Thread 6: 1000000006
Thread 7: 1000000007
Thread 8: 1000000008
Thread 9: 1000000009
Thread 10: 1000000010
Thread 11: 1000000011
Thread 12: 1000000012
Thread 13: 1000000013
Thread 14: 1000000014
Thread 15: 1000000015
Thread 16: 1000000016
Thread 17: 1000000017
Thread 18: 1000000018
Thread 19: 1000000019
Thread 20: 1000000020
Thread 21: 1000000021
Thread 22: 1000000022
Thread 23: 1000000023
Thread 24: 1000000024
Thread 25: 1000000025
Thread 26: 1000000026
Thread 27: 1000000027
Thread 28: 1000000028
Thread 29: 1000000029
Thread 30: 1000000030
Thread 31: 1000000031
Thread 32: 1000000032
Thread 33: 1000000033
Thread 34: 1000000034
Thread 35: 1000000035
Thread 36: 1000000036
Thread 37: 1000000037
Thread 38: 1000000038
Thread 39: 1000000039
Thread 40: 1000000040
Thread 41: 1000000041
Thread 42: 1000000042
Thread 43: 1000000043
Thread 44: 1000000044
Thread 45: 1000000045
Thread 46: 1000000046
Thread 47: 1000000047
Thread 48: 1000000048
Thread 49: 1000000049
Thread 50: 1000000050
Thread 51: 1000000051
Thread 52: 1000000052
Thread 53: 1000000053
Thread 54: 1000000054
Thread 55: 1000000055
Thread 56: 1000000056
Thread 57: 1000000057
Thread 58: 1000000058
Thread 59: 1000000059
Thread 60: 1000000060
Thread 61: 1000000061
Thread 62: 1000000062
Thread 63: 1000000063
╭─pvelesko@cupcake ~/chipStar/openmm-fixes/testPermute ‹master●› 
╰─$ 
╭─pvelesko@cupcake ~/chipStar/openmm-fixes/testPermute ‹master●› 
╰─$ hipcc -g -O0 pureHip.cpp -o pureHip && ./pureHip                                                                                                                                                                      130 ↵
HIP error: the operation cannot be performed in the present state at pureHip.cpp:54

Source code:

#include <hip/hip_runtime.h>
#include <iostream>
#include <vector>
#include <fstream>

#define HIP_CHECK(cmd) \
    do { \
        hipError_t error = cmd; \
        if (error != hipSuccess) { \
            std::cerr << "HIP error: " << hipGetErrorString(error) << " at " << __FILE__ << ":" << __LINE__ << "\n"; \
            exit(1); \
        } \
    } while(0)

__global__ void hip_bpermute(const int* inData, int* outData) {
    int tid = threadIdx.x;
    int lane = tid & 63;  // Get lane ID within wavefront
    
    // Load the value this thread will share
    int src_data = inData[tid];
    
    // The byte offset is lane * 4 (each int is 4 bytes)
    int src_lane = lane * 4;
    
    // Call the builtin with byte offset and source data
    int result = __builtin_amdgcn_ds_bpermute(src_lane, src_data);
    outData[tid] = result;
}

int main() {
    const int globalSize = 64;

    // Prepare input data
    std::vector<int> inData(globalSize);
    for (int i = 0; i < globalSize; ++i) {
        inData[i] = 1000000000 + i;  // Fill with some pattern
    }

    // Array for results
    std::vector<int> outData(globalSize);

    int *d_inData, *d_outData;
    HIP_CHECK(hipMalloc(&d_inData, globalSize * sizeof(int)));
    HIP_CHECK(hipMalloc(&d_outData, globalSize * sizeof(int)));
    
    HIP_CHECK(hipMemcpy(d_inData, inData.data(), globalSize * sizeof(int), hipMemcpyHostToDevice));
    
    HIP_CHECK(hipDeviceSynchronize());
    hipLaunchKernelGGL(hip_bpermute, 
                       dim3(1), 
                       dim3(globalSize), 
                       0, 0,
                       d_inData, d_outData);
    HIP_CHECK(hipGetLastError());
    HIP_CHECK(hipDeviceSynchronize());
    
    HIP_CHECK(hipMemcpy(outData.data(), d_outData, globalSize * sizeof(int), hipMemcpyDeviceToHost));
    
    HIP_CHECK(hipFree(d_inData));
    HIP_CHECK(hipFree(d_outData));

    // Print results
    std::cout << "Results:\n\n";
    for (int i = 0; i < globalSize; ++i) {
        std::cout << "Thread " << i << ": " << outData[i] << "\n";
    }

    return 0;
}
╰─$ rocminfo                                                                                                                                                                                                              130 ↵
ROCk module version 6.8.5 is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    13th Gen Intel(R) Core(TM) i9-13900K
  Uuid:                    CPU-XX                             
  Marketing Name:          13th Gen Intel(R) Core(TM) i9-13900K
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3000                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    65597496(0x3e8f038) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65597496(0x3e8f038) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    65597496(0x3e8f038) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx906                             
  Uuid:                    GPU-a5de794172dc768b               
  Marketing Name:          AMD Radeon VII                     
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      8192(0x2000) KB                    
  Chip ID:                 26287(0x66af)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1801                               
  BDFID:                   2560                               
  Internal Node ID:        1                                  
  Compute Unit:            60                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 472                                
  SDMA engine uCode::      145                                
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

Operating System

Ubuntu 22.04

CPU

Intel i9

GPU

Vega 20

ROCm Version

ROCm 6.3.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@cjatin
Copy link
Contributor

cjatin commented Jan 15, 2025

I think this might be a compiler issue. But just to be sure, can you share the logs by setting the environment variable:
AMD_LOG_LEVEL=7 ./a.out

@pvelesko
Copy link
Author

╰─$ AMD_LOG_LEVEL=7 ./pureHip                                                                                                                                                                                             130 ↵
:3:rocdevice.cpp            :468 : 14902569084 us: [pid:32899 tid:0x71038c239180] Initializing HSA stack.
:3:rocdevice.cpp            :554 : 14902584448 us: [pid:32899 tid:0x71038c239180] Enumerated GPU agents = 1
:3:rocdevice.cpp            :232 : 14902584485 us: [pid:32899 tid:0x71038c239180] Numa selects cpu agent[0]=0x31c6f450(fine=0x31c6f670,coarse=0x31c702b0) for gpu agent=0x31c71680 CPU<->GPU XGMI=0
:3:rocsettings.cpp          :290 : 14902584494 us: [pid:32899 tid:0x71038c239180] Using dev kernel arg wa = 0
:3:comgrctx.cpp             :33  : 14902584498 us: [pid:32899 tid:0x71038c239180] Loading COMGR library.
:3:comgrctx.cpp             :126 : 14902584531 us: [pid:32899 tid:0x71038c239180] Loaded COMGR library version 2.8.
:3:rocdevice.cpp            :1809: 14902584865 us: [pid:32899 tid:0x71038c239180] Gfx Major/Minor/Stepping: 9/0/6
:3:rocdevice.cpp            :1811: 14902584869 us: [pid:32899 tid:0x71038c239180] HMM support: 1, XNACK: 0, Direct host access: 0
:3:rocdevice.cpp            :1813: 14902584872 us: [pid:32899 tid:0x71038c239180] Max SDMA Read Mask: 0x3, Max SDMA Write Mask: 0x3
:4:rocdevice.cpp            :2221: 14902585122 us: [pid:32899 tid:0x71038c239180] Allocate hsa host memory 0x710280d00000, size 0x101000, numa_node = 0
:4:rocdevice.cpp            :2221: 14902585398 us: [pid:32899 tid:0x71038c239180] Allocate hsa host memory 0x710280b00000, size 0x101000, numa_node = 0
:4:rocdevice.cpp            :2221: 14902586010 us: [pid:32899 tid:0x71038c239180] Allocate hsa host memory 0x710280600000, size 0x400000, numa_node = 0
:4:rocdevice.cpp            :2221: 14902586040 us: [pid:32899 tid:0x71038c239180] Allocate hsa host memory 0x71038dd3c000, size 0x38, numa_node = 0
:4:runtime.cpp              :85  : 14902586046 us: [pid:32899 tid:0x71038c239180] init
:3:hip_context.cpp          :49  : 14902586048 us: [pid:32899 tid:0x71038c239180] Direct Dispatch: 1
:3:hip_memory.cpp           :615 : 14902586820 us: [pid:32899 tid:0x71038c239180]  hipMalloc ( 0x7fff99ac8430, 256 ) 
:4:rocdevice.cpp            :2379: 14902586849 us: [pid:32899 tid:0x71038c239180] Allocate hsa device memory 0x71027bc00000, size 0x100
:3:rocdevice.cpp            :2418: 14902586851 us: [pid:32899 tid:0x71038c239180] Device=0x31c918f0, freeMem_ = 0x3feffff00
:3:hip_memory.cpp           :617 : 14902586859 us: [pid:32899 tid:0x71038c239180] hipMalloc: Returned hipSuccess : 0x71027bc00000: duration: 39 us
:3:hip_memory.cpp           :615 : 14902586863 us: [pid:32899 tid:0x71038c239180]  hipMalloc ( 0x7fff99ac8428, 256 ) 
:4:rocdevice.cpp            :2379: 14902586867 us: [pid:32899 tid:0x71038c239180] Allocate hsa device memory 0x71027bc01000, size 0x100
:3:rocdevice.cpp            :2418: 14902586869 us: [pid:32899 tid:0x71038c239180] Device=0x31c918f0, freeMem_ = 0x3fefffe00
:3:hip_memory.cpp           :617 : 14902586872 us: [pid:32899 tid:0x71038c239180] hipMalloc: Returned hipSuccess : 0x71027bc01000: duration: 9 us
:3:hip_memory.cpp           :690 : 14902586880 us: [pid:32899 tid:0x71038c239180]  hipMemcpy ( 0x71027bc00000, 0x31b8e230, 256, hipMemcpyHostToDevice ) 
:3:rocdevice.cpp            :3026: 14902586886 us: [pid:32899 tid:0x71038c239180] Number of allocated hardware queues with low priority: 0, with normal priority: 0, with high priority: 0, maximum per priority is: 4
:3:rocdevice.cpp            :3104: 14902593357 us: [pid:32899 tid:0x71038c239180] Created SWq=0x71038bf12000 to map on HWq=0x710280300000 with size 16384 with priority 1, cooperative: 0
:3:rocdevice.cpp            :3197: 14902593371 us: [pid:32899 tid:0x71038c239180] acquireQueue refCount: 0x710280300000 (1)
:4:rocdevice.cpp            :2221: 14902593581 us: [pid:32899 tid:0x71038c239180] Allocate hsa host memory 0x710278300000, size 0x100000, numa_node = 0
:3:devprogram.cpp           :2648: 14902821492 us: [pid:32899 tid:0x71038c239180] Using Code Object V5.
:4:command.cpp              :347 : 14902822591 us: [pid:32899 tid:0x71038c239180] Command (CopyHostToDevice) enqueued: 0x31d0a1f0
:4:rocblit.cpp              :832 : 14902823022 us: [pid:32899 tid:0x71038c239180] HSA Async Copy staged H2D dst=0x71027bc00000, src=0x710280d00000, size=256, completion_signal=0x7103827ff700
:4:rocvirtual.cpp           :571 : 14902823027 us: [pid:32899 tid:0x71038c239180] Host wait on completion_signal=0x7103827ff700
:3:rocvirtual.hpp           :66  : 14902823029 us: [pid:32899 tid:0x71038c239180] Host active wait for Signal = (0x7103827ff700) for -1 ns
:4:command.cpp              :287 : 14902823090 us: [pid:32899 tid:0x71038c239180] Queue marker to command queue: 0x31b85aa0
:4:command.cpp              :347 : 14902823093 us: [pid:32899 tid:0x71038c239180] Command (InternalMarker) enqueued: 0x31fd9c70
:4:command.cpp              :177 : 14902823095 us: [pid:32899 tid:0x71038c239180] Command 0x31d0a1f0 complete
:4:command.cpp              :171 : 14902823098 us: [pid:32899 tid:0x71038c239180] Command 0x31fd9c70 complete (Wall: 14902823098, CPU: 0, GPU: 0 us)
:4:command.cpp              :251 : 14902823101 us: [pid:32899 tid:0x71038c239180] Waiting for event 0x31d0a1f0 to complete, current status 0
:4:command.cpp              :266 : 14902823104 us: [pid:32899 tid:0x71038c239180] Event 0x31d0a1f0 wait completed
:3:hip_memory.cpp           :691 : 14902823108 us: [pid:32899 tid:0x71038c239180] hipMemcpy: Returned hipSuccess : : duration: 236228 us
:3:hip_device_runtime.cpp   :620 : 14902823114 us: [pid:32899 tid:0x71038c239180]  hipDeviceSynchronize (  ) 
:4:commandqueue.cpp         :147 : 14902823117 us: [pid:32899 tid:0x71038c239180] HW Event not ready, awaiting completion instead
:4:commandqueue.cpp         :163 : 14902823119 us: [pid:32899 tid:0x71038c239180] All commands finished
:3:hip_device_runtime.cpp   :624 : 14902823122 us: [pid:32899 tid:0x71038c239180] hipDeviceSynchronize: Returned hipSuccess : 
:3:hip_platform.cpp         :225 : 14902823129 us: [pid:32899 tid:0x71038c239180]  __hipPushCallConfiguration ( {1,1,1}, {64,1,1}, 0, stream:<null> ) 
:3:hip_platform.cpp         :229 : 14902823133 us: [pid:32899 tid:0x71038c239180] __hipPushCallConfiguration: Returned hipSuccess : 
:3:hip_platform.cpp         :234 : 14902823144 us: [pid:32899 tid:0x71038c239180]  __hipPopCallConfiguration ( {2578220456,32767,2180160}, {64,1,0}, 0x7fff99ac80f8, 0x7fff99ac80f0 ) 
:3:hip_platform.cpp         :243 : 14902823147 us: [pid:32899 tid:0x71038c239180] __hipPopCallConfiguration: Returned hipSuccess : 
:3:hip_module.cpp           :677 : 14902823796 us: [pid:32899 tid:0x71038c239180]  hipLaunchKernel ( 0x200e10, {1,1,1}, {64,1,1}, 0x7fff99ac80c0, 0, stream:<null> ) 
:3:devprogram.cpp           :2648: 14902823897 us: [pid:32899 tid:0x71038c239180] Using Code Object V5.
:4:command.cpp              :347 : 14902824195 us: [pid:32899 tid:0x71038c239180] Command (KernelExecution) enqueued: 0x31cf8060
:3:rocvirtual.cpp           :724 : 14902824199 us: [pid:32899 tid:0x71038c239180] Arg0:   = ptr:0x71027bc00000 obj:[0x71027bc00000-0x71027bc00100]
:3:rocvirtual.cpp           :724 : 14902824201 us: [pid:32899 tid:0x71038c239180] Arg1:   = ptr:0x71027bc01000 obj:[0x71027bc01000-0x71027bc01100]
:3:rocvirtual.cpp           :3028: 14902824203 us: [pid:32899 tid:0x71038c239180] ShaderName : _Z12hip_bpermutePKiPi
:1:rocvirtual.cpp           :3075: 14902824206 us: [pid:32899 tid:0x71038c239180] Pcie atomics not enabled, hostcall not supported
:1:rocvirtual.cpp           :3432: 14902824208 us: [pid:32899 tid:0x71038c239180] AQL dispatch failed!
:4:command.cpp              :177 : 14902824209 us: [pid:32899 tid:0x71038c239180] Command 0x31cf8060 complete
:3:hip_module.cpp           :678 : 14902824212 us: [pid:32899 tid:0x71038c239180] hipLaunchKernel: Returned hipErrorIllegalState : 
:3:hip_error.cpp            :36  : 14902824216 us: [pid:32899 tid:0x71038c239180]  hipGetLastError (  ) 
HIP error: the operation cannot be performed in the present state at pureHip.cpp:54
:3:hip_device_runtime.cpp   :620 : 14902824270 us: [pid:32899 tid:0x71038c239180]  hipDeviceSynchronize (  ) 
:4:commandqueue.cpp         :147 : 14902824273 us: [pid:32899 tid:0x71038c239180] HW Event not ready, awaiting completion instead
:4:commandqueue.cpp         :163 : 14902824279 us: [pid:32899 tid:0x71038c239180] All commands finished
:3:hip_device_runtime.cpp   :624 : 14902824281 us: [pid:32899 tid:0x71038c239180] hipDeviceSynchronize: Returned hipSuccess : 
:1:hip_fatbin.cpp           :91  : 14902824287 us: [pid:32899 tid:0x71038c239180] All Unique FDs are closed
:4:command.cpp              :347 : 14902824360 us: [pid:32899 tid:0x71038c239180] Command (Marker) enqueued: 0x31fd9c70
:3:rocvirtual.cpp           :475 : 14902824374 us: [pid:32899 tid:0x71038c239180] Set Handler: handle(0x7103827ff680), timestamp(0x31ff7c20)
:4:rocvirtual.cpp           :1076: 14902824379 us: [pid:32899 tid:0x71038c239180] SWq=0x71038bf12000, HWq=0x710280300000, id=1, BarrierAND Header = 0x1503 (type=3, barrier=1, acquire=2, release=2), dep_signal=[0x0, 0x0, 0x0, 0x0, 0x0], completion_signal=0x7103827ff680
:4:command.cpp              :251 : 14902824382 us: [pid:32899 tid:0x71038c239180] Waiting for event 0x31fd9c70 to complete, current status 2
:3:rocvirtual.cpp           :222 : 14902824501 us: [pid:32899 tid:0x71027a400640] Handler: value(0), timestamp(0x32106140), handle(0x7103827ff680)
:1:rocvirtual.cpp           :1586: 14902824515 us: [pid:32899 tid:0x71027a400640] Unexpected command status - -59.
:4:command.cpp              :266 : 14902824644 us: [pid:32899 tid:0x71038c239180] Event 0x31fd9c70 wait completed
:4:command.cpp              :171 : 14902824645 us: [pid:32899 tid:0x71027a400640] Command 0x31fd9c70 complete (Wall: 14902824643, CPU: 0, GPU: 276 us)
:4:rocdevice.cpp            :2395: 14902824734 us: [pid:32899 tid:0x71038c239180] Free hsa memory 0x710278300000
:4:rocdevice.cpp            :2395: 14902824737 us: [pid:32899 tid:0x71038c239180] Free hsa memory (nil)
:3:rocdevice.cpp            :3209: 14902824740 us: [pid:32899 tid:0x71038c239180] releaseQueue refCount:0x710280300000 (0)
:4:runtime.cpp              :93  : 14902824794 us: [pid:32899 tid:0x71038c239180] tearDown
:4:rocdevice.cpp            :2395: 14902824838 us: [pid:32899 tid:0x71038c239180] Free hsa memory 0x710280600000
:4:rocdevice.cpp            :2395: 14902824865 us: [pid:32899 tid:0x71038c239180] Free hsa memory 0x71038dd3c000
:3:rocdevice.cpp            :285 : 14902824868 us: [pid:32899 tid:0x71038c239180] Deleting hardware queue 0x710280300000 with refCount 0
:4:rocdevice.cpp            :2395: 14902825893 us: [pid:32899 tid:0x71038c239180] Free hsa memory 0x710280b00000
:4:rocdevice.cpp            :2395: 14902825954 us: [pid:32899 tid:0x71038c239180] Free hsa memory 0x710280d00000

@cjatin
Copy link
Contributor

cjatin commented Jan 15, 2025

Ohh:

:1:rocvirtual.cpp           :3075: 14902824206 us: [pid:32899 tid:0x71038c239180] Pcie atomics not enabled, hostcall not supported
:1:rocvirtual.cpp           :3432: 14902824208 us: [pid:32899 tid:0x71038c239180] AQL dispatch failed!

I think you might need to enable largebar. It should be in your bios, something like 4G Decode and resize bar support

@pvelesko
Copy link
Author

resize bar is on + doesn't explain why optimization is affecting this?

@cjatin
Copy link
Contributor

cjatin commented Jan 15, 2025

I think, due to inclusion of hip_assert.h header which has hostcall prints in functions like __assert_fail
With optimizations that host call gets removed but lingers around when -O0 is used.

@pvelesko
Copy link
Author

More issues - this time failing tests without any error:

╭─pvelesko@cupcake ~/computeNonbonded-debug ‹master●› 
╰─$ /usr/bin/hipcc ./warp_test.cpp -w -O1 -o warp_test-amd && ./warp_test-amd                                                                                                                                                                                          130 ↵
Launching kernel with grid=4, block=64

Results for first few threads:
GlobalIdx       Value
0               0
1               1
2               6
3               18
4               40
5               75
6               126
7               196

╭─pvelesko@cupcake ~/computeNonbonded-debug ‹master●› 
╰─$ /usr/bin/hipcc ./warp_test.cpp -w -O0 -o warp_test-amd && ./warp_test-amd                                                                                                                                130 ↵
Launching kernel with grid=4, block=64

Results for first few threads:
GlobalIdx       Value
0               0
1               0
2               0
3               0
4               0
5               0
6               0
7               0
#include <hip/hip_runtime.h>
#include <stdio.h>

__global__ void testWarpCalc(int* debug) {
    int tid = threadIdx.x;
    int bid = blockIdx.x;
    int globalIdx = bid * blockDim.x + tid;
    
    // Do some computation to prevent optimization
    int result = 0;
    for(int i = 0; i < tid + 1; i++) {
        result += i * globalIdx;
    }
    
    // Store using atomic operation
    atomicExch(&debug[globalIdx], result);
}

int main() {
    const int gridSize = 4;
    const int blockSize = 64;
    const int numThreads = gridSize * blockSize;

    // Allocate pinned memory
    int* h_debug;
    hipHostMalloc(&h_debug, numThreads * sizeof(int));
    memset(h_debug, 0, numThreads * sizeof(int));

    // Allocate device memory
    int* d_debug;
    hipMalloc(&d_debug, numThreads * sizeof(int));
    hipMemset(d_debug, 0, numThreads * sizeof(int));

    dim3 grid(gridSize);
    dim3 block(blockSize);
    
    printf("Launching kernel with grid=%d, block=%d\n\n", gridSize, blockSize);
    
    // Use triple angle bracket syntax
    testWarpCalc<<<grid, block>>>(d_debug);
    hipDeviceSynchronize();
    
    // Copy results back
    hipMemcpy(h_debug, d_debug, numThreads * sizeof(int), hipMemcpyDeviceToHost);
    
    printf("Results for first few threads:\n");
    printf("GlobalIdx\tValue\n");
    
    // Print first few entries
    for (int i = 0; i < 8; i++) {
        printf("%d\t\t%d\n", i, h_debug[i]);
    }

    // Cleanup
    hipHostFree(h_debug);
    hipFree(d_debug);
    
    return 0;
} %   

@cjatin
Copy link
Contributor

cjatin commented Jan 17, 2025

I think the issue is the same, with -O0, the lingering __assert_fail which requires host call prints, which rely on PCI-e atomics (which are disabled on the system). So if you check the results with AMD_LOG_LEVEL=7 you should see similar failure to launch kernel in the example above.

@pvelesko
Copy link
Author

Ok yeah I can confirm that atomics are disabled - this is a BIOS setting, correct?

@cjatin
Copy link
Contributor

cjatin commented Jan 23, 2025

I think so, its mostly system configuration setting.
There is one mentioned for a server class system here: https://rocm.docs.amd.com/en/docs-6.0.0/conceptual/More-about-how-ROCm-uses-PCIe-Atomics.html#bar-memory-overview

But I guess that does not apply to you. You might need to look up your motherboard bios settings.

@pvelesko
Copy link
Author

pvelesko commented Jan 23, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants