Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: using hipMemcpy leading to a SIGSEGV crash #116

Closed
DarkMatter-999 opened this issue Dec 23, 2024 · 1 comment
Closed

[Issue]: using hipMemcpy leading to a SIGSEGV crash #116

DarkMatter-999 opened this issue Dec 23, 2024 · 1 comment

Comments

@DarkMatter-999
Copy link

Problem Description

When executing a program compiled with hipcc it crashes after attempting to call hipMemcpy

$ hipcc test.cu
$ AMD_LOG_LEVEL=7  ./a.out
Max size: 4096
:3:rocdevice.cpp            :468 : 0865319405 us: [pid:3017  tid:0x7782cc1ff180] Initializing HSA stack.
:3:rocdevice.cpp            :554 : 0865326945 us: [pid:3017  tid:0x7782cc1ff180] Enumerated GPU agents = 1
:3:rocdevice.cpp            :232 : 0865326984 us: [pid:3017  tid:0x7782cc1ff180] Numa selects cpu agent[0]=0x55e59ce8df10(fine=0x55e59ce89ac0,coarse=0x55e59ce89fe0) for gpu agent=0x55e59ce8a880 CPU<->GPU XGMI=0
:3:rocsettings.cpp          :290 : 0865326992 us: [pid:3017  tid:0x7782cc1ff180] Using dev kernel arg wa = 0
:3:comgrctx.cpp             :33  : 0865326997 us: [pid:3017  tid:0x7782cc1ff180] Loading COMGR library.
:3:comgrctx.cpp             :126 : 0865327028 us: [pid:3017  tid:0x7782cc1ff180] Loaded COMGR library version 2.8.
:3:rocdevice.cpp            :1809: 0865327202 us: [pid:3017  tid:0x7782cc1ff180] Gfx Major/Minor/Stepping: 8/0/3
:3:rocdevice.cpp            :1811: 0865327209 us: [pid:3017  tid:0x7782cc1ff180] HMM support: 0, XNACK: 0, Direct host access: 0
:3:rocdevice.cpp            :1813: 0865327214 us: [pid:3017  tid:0x7782cc1ff180] Max SDMA Read Mask: 0x3, Max SDMA Write Mask: 0x3
:4:rocdevice.cpp            :2221: 0865327682 us: [pid:3017  tid:0x7782cc1ff180] Allocate hsa host memory 0x5003ff000, size 0x101000, numa_node = 0
:4:rocdevice.cpp            :2221: 0865328294 us: [pid:3017  tid:0x7782cc1ff180] Allocate hsa host memory 0x5006ff000, size 0x101000, numa_node = 0
:4:rocdevice.cpp            :2221: 0865329215 us: [pid:3017  tid:0x7782cc1ff180] Allocate hsa host memory 0x500c00000, size 0x400000, numa_node = 0
:4:rocdevice.cpp            :2221: 0865329377 us: [pid:3017  tid:0x7782cc1ff180] Allocate hsa host memory 0x204000, size 0x38, numa_node = 0
:4:runtime.cpp              :85  : 0865329459 us: [pid:3017  tid:0x7782cc1ff180] init
:3:hip_context.cpp          :49  : 0865329465 us: [pid:3017  tid:0x7782cc1ff180] Direct Dispatch: 1
:3:hip_event.cpp            :333 : 0865329614 us: [pid:3017  tid:0x7782cc1ff180]  hipEventCreate ( 0x7fff7f99f7e0 )
:3:hip_event.cpp            :339 : 0865329630 us: [pid:3017  tid:0x7782cc1ff180] hipEventCreate: Returned hipSuccess : event:0x55e59cebf370
:3:hip_event.cpp            :333 : 0865329636 us: [pid:3017  tid:0x7782cc1ff180]  hipEventCreate ( 0x7fff7f99f7d8 )
:3:hip_event.cpp            :339 : 0865329642 us: [pid:3017  tid:0x7782cc1ff180] hipEventCreate: Returned hipSuccess : event:0x55e59cebf640
:3:hip_memory.cpp           :615 : 0867968592 us: [pid:3017  tid:0x7782cc1ff180]  hipMalloc ( 0x7fff7f99f768, 67108864 )
:4:rocdevice.cpp            :2379: 0867969348 us: [pid:3017  tid:0x7782cc1ff180] Allocate hsa device memory 0x501400000, size 0x4000000
:3:rocdevice.cpp            :2418: 0867969356 us: [pid:3017  tid:0x7782cc1ff180] Device=0x55e59ceb7330, freeMem_ = 0xfc000000
:3:hip_memory.cpp           :617 : 0867969368 us: [pid:3017  tid:0x7782cc1ff180] hipMalloc: Returned hipSuccess : 0x501400000: duration: 776 us
:3:hip_memory.cpp           :615 : 0867969374 us: [pid:3017  tid:0x7782cc1ff180]  hipMalloc ( 0x7fff7f99f760, 67108864 )
:4:rocdevice.cpp            :2379: 0867969806 us: [pid:3017  tid:0x7782cc1ff180] Allocate hsa device memory 0x505800000, size 0x4000000
:3:rocdevice.cpp            :2418: 0867969814 us: [pid:3017  tid:0x7782cc1ff180] Device=0x55e59ceb7330, freeMem_ = 0xf8000000
:3:hip_memory.cpp           :617 : 0867969818 us: [pid:3017  tid:0x7782cc1ff180] hipMalloc: Returned hipSuccess : 0x505800000: duration: 444 us
:3:hip_memory.cpp           :615 : 0867969826 us: [pid:3017  tid:0x7782cc1ff180]  hipMalloc ( 0x7fff7f99f710, 67108864 )
:4:rocdevice.cpp            :2379: 0867970266 us: [pid:3017  tid:0x7782cc1ff180] Allocate hsa device memory 0x509c00000, size 0x4000000
:3:rocdevice.cpp            :2418: 0867970274 us: [pid:3017  tid:0x7782cc1ff180] Device=0x55e59ceb7330, freeMem_ = 0xf4000000
:3:hip_memory.cpp           :617 : 0867970279 us: [pid:3017  tid:0x7782cc1ff180] hipMalloc: Returned hipSuccess : 0x509c00000: duration: 453 us
:3:hip_memory.cpp           :615 : 0867970285 us: [pid:3017  tid:0x7782cc1ff180]  hipMalloc ( 0x7fff7f99f7d0, 67108864 )
:4:rocdevice.cpp            :2379: 0867970647 us: [pid:3017  tid:0x7782cc1ff180] Allocate hsa device memory 0x50e000000, size 0x4000000
:3:rocdevice.cpp            :2418: 0867970654 us: [pid:3017  tid:0x7782cc1ff180] Device=0x55e59ceb7330, freeMem_ = 0xf0000000
:3:hip_memory.cpp           :617 : 0867970658 us: [pid:3017  tid:0x7782cc1ff180] hipMalloc: Returned hipSuccess : 0x50e000000: duration: 373 us
:3:hip_memory.cpp           :690 : 0867970676 us: [pid:3017  tid:0x7782cc1ff180]  hipMemcpy ( 0x501400000, 0x7782affff010, 67108864, hipMemcpyHostToDevice )
:3:rocdevice.cpp            :3026: 0867970693 us: [pid:3017  tid:0x7782cc1ff180] Number of allocated hardware queues with low priority: 0, with normal priority: 0, with high priority: 0, maximum per priority is: 4
:4:command.cpp              :347 : 0868002516 us: [pid:3017  tid:0x7782cc1ff180] Command (CopyHostToDevice) enqueued: 0x55e59cec0ef0
fish: Job 1, 'AMD_LOG_LEVEL=7  ./a.out' terminated by signal SIGSEGV (Address boundary error)

$ sudo dmesg | tail
[  867.972499] a.out[3017]: segfault at 18 ip 00007782ccbca89d sp 00007fff7f99ef40 error 4 in libamdhip64.so.6.2.41134-65d174c3e[3ca89d,7782cc824000+430000] likely on CPU 10 (core 5, socket 0)
[  867.972514] Code: 00 31 f6 48 83 c5 08 e8 31 04 00 00 49 39 ec 75 ec 48 8b 83 48 01 00 00 48 8b a8 a0 01 00 00 48 8b 05 f7 64 1c 00 64 48 8b 10 <48> 8b 45 18 4c 8d 65 18 a8 01 0f 85 9b 03 00 00 48 89 c1 48 83 c9
[ 1561.179983] a.out[3682]: segfault at 18 ip 00007a57651b0f3d sp 00007ffce2aba720 error 4 in libamdhip64.so.6.1.40093-bd86f1708[3b0f3d,7a5764e23000+416000] likely on CPU 11 (core 6, socket 0)
[ 1561.179998] Code: 00 31 f6 48 83 c5 08 e8 61 04 00 00 49 39 ec 75 ec 48 8b 83 48 01 00 00 48 8b a8 a0 01 00 00 48 8b 05 67 de 1b 00 64 48 8b 10 <48> 8b 45 18 4c 8d 65 18 a8 01 0f 85 9b 03 00 00 48 89 c1 48 83 c9

The same program when run on gfx90c runs successfully.

I attempted to try to use the patch at #97 but it still doesnt work.

I know gfx803 is long deprecated but thats what I have at the moment, also I have successfully used the same device earlier(3-4 months) to run similar code which compiled and worked right away then.

This is the program I was running, it fails at line:78 at hipMemcpy

#include <hip/hip_runtime.h>
#include <stdio.h>
#include <stdlib.h>

#define CEIL_DIV(a, b) (((a) + (b) - 1) / (b))

#define hipCheck(ans) { hipAssert((ans), __FILE__, __LINE__); }
inline void hipAssert(hipError_t code, const char *file, int line, bool abort = true) {
    if (code != hipSuccess) {
        printf("HIP error: %s in file %s at line %d\n", hipGetErrorString(code), file, line);
        if (abort) exit(code);
    }
}

__global__ void sgemm_naive(int M, int N, int K, float alpha, const float *A,
                            const float *B, float beta, float *C) {
  const uint x = blockIdx.x * blockDim.x + threadIdx.x;
  const uint y = blockIdx.y * blockDim.y + threadIdx.y;

  // if statement is necessary to make things work under tile quantization
  if (x < M && y < N) {
    float tmp = 0.0;
    for (int i = 0; i < K; ++i) {
      tmp += A[x * K + i] * B[i * N + y];
    }
    // C = α*(A@B)+β*C
    C[x * N + y] = alpha * tmp + beta * C[x * N + y];
  }
}
void run_sgemm_naive(int M, int N, int K, float alpha, float *A, float *B,
                     float beta, float *C) {
  dim3 gridDim(CEIL_DIV(M, 32), CEIL_DIV(N, 32));
  dim3 blockDim(32, 32);
  sgemm_naive<<<gridDim, blockDim>>>(M, N, K, alpha, A, B, beta, C);
}

void randomize_matrix(float *mat, int N) {
	for (int i = 0; i < N; i++) {
		float tmp = (float)(rand() % 5) + 0.01 * (rand() % 5);
		tmp = (rand() % 2 == 0) ? tmp : tmp * (-1.);
		mat[i] = tmp;
	}
}

int main() {
	int N = 6;
    int SIZE[] = {128, 256, 512, 1024, 2048, 4096};

	long m, n, k, max_size;
	max_size = SIZE[N - 1];
	printf("Max size: %ld\n", max_size);

	float alpha = 0.5, beta = 3.0; // GEMM input parameters, C=α*AB+β*C

	float *A = nullptr, *B = nullptr, *C = nullptr, *C_ref = nullptr; // host matrices
	float *dA = nullptr, *dB = nullptr, *dC = nullptr, *dC_ref = nullptr; // device matrices

	float elapsed_time;
	hipEvent_t beg, end;
	hipCheck(hipEventCreate(&beg));
	hipCheck(hipEventCreate(&end));

	A = (float *)malloc(sizeof(float) * max_size * max_size);
	B = (float *)malloc(sizeof(float) * max_size * max_size);
	C = (float *)malloc(sizeof(float) * max_size * max_size);
	C_ref = (float *)malloc(sizeof(float) * max_size * max_size);


	randomize_matrix(A, max_size * max_size);
	randomize_matrix(B, max_size * max_size);
	randomize_matrix(C, max_size * max_size);

	hipCheck(hipMalloc(&dA, sizeof(float) * max_size * max_size));
	hipCheck(hipMalloc(&dB, sizeof(float) * max_size * max_size));
	hipCheck(hipMalloc(&dC, sizeof(float) * max_size * max_size));
	hipCheck(hipMalloc(&dC_ref, sizeof(float) * max_size * max_size));

	hipCheck(hipMemcpy(dA, A, sizeof(float) * max_size * max_size, hipMemcpyHostToDevice));
	hipCheck(hipMemcpy(dB, B, sizeof(float) * max_size * max_size, hipMemcpyHostToDevice));
	hipCheck(hipMemcpy(dC, C, sizeof(float) * max_size * max_size, hipMemcpyHostToDevice));
	hipCheck(hipMemcpy(dC_ref, C_ref, sizeof(float) * max_size * max_size, hipMemcpyHostToDevice));

	int repeat_times = 5;
	for (int s=0; s<N; s++) {
		int size = SIZE[s];
		m = n = k = size;
		
		printf("dimensions(m=n=k) %ld, alpha: %f, beta: %f\n", m, alpha, beta);
		
		hipCheck(hipEventRecord(beg));

		for (int j = 0; j < repeat_times; j++) {
			// We don't reset dC between runs to save time
			run_sgemm_naive(m, n, k, alpha, dA, dB, beta, dC);
		}

		hipCheck(hipEventRecord(end));
		hipCheck(hipEventSynchronize(beg));
		hipCheck(hipEventSynchronize(end));
		hipCheck(hipEventElapsedTime(&elapsed_time, beg, end));
		elapsed_time /= 1000.; // Convert to seconds

		long flops = 2 * m * n * k;
		printf(
        "Average elapsed time: (%7.6f) s, performance: (%7.1f) GFLOPS. size: "
        "(%ld).\n",
        elapsed_time / repeat_times,
        (repeat_times * flops * 1e-9) / elapsed_time, m);
	    fflush(stdout);

		hipCheck(hipMemcpy(dC, dC_ref, sizeof(float) * m * n, hipMemcpyDeviceToDevice));

	}

	free(A);
	free(B);
	free(C);
	free(C_ref);
	hipCheck(hipFree(dA));
	hipCheck(hipFree(dB));
	hipCheck(hipFree(dC));
	hipCheck(hipFree(dC_ref));

    return 0;
}

Operating System

Arch Linux

CPU

AMD Ryzen 5 1600 Six-Core Processor

GPU

AMD Radeon RX 570 Series

ROCm Version

ROCm 6.1.0

ROCm Component

clr

Steps to Reproduce

  1. Compile a program with hipMemcpy using hipcc
	hipCheck(hipMemcpy(dA, A, sizeof(float) * max_size * max_size, hipMemcpyHostToDevice));
  1. Execute the program

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded

HSA System Attributes

Runtime Version: 1.1
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES

==========
HSA Agents


Agent 1


Name: AMD Ryzen 5 1600 Six-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 5 1600 Six-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3200
BDFID: 0
Internal Node ID: 0
Compute Unit: 12
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 16322532(0xf90fe4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 16322532(0xf90fe4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16322532(0xf90fe4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: gfx803
Uuid: GPU-XX
Marketing Name: AMD Radeon RX 570 Series
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 26591(0x67df)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1244
BDFID: 10496
Internal Node ID: 1
Compute Unit: 32
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 730
SDMA engine uCode:: 58
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 4194304(0x400000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 4194304(0x400000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx803
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

Additional Information

No response

@jamesxu2
Copy link

Hi @DarkMatter-999, if you don't mind, let's consolidate the investigation on ROCm/ROCm: ROCm/ROCm#4189

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants