Skip to content

Conversation

@Vipul-Cariappa
Copy link
Collaborator

Required for benchmarks.
Enables calling CUDA kernels directly from Python, without any C/C++ wrapper on the CUDA kernel.

TODO
Add test at cppyy. The following may be sufficient.

import ctypes

import cppjit
from cppjit import cpp

cppjit.cppdef(
    r"""
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < n) {
    c[i] = a[i] + b[i];
  }
}
"""
)

N = 10
size = N * cppjit.sizeof(float)

a = ctypes.POINTER(ctypes.c_float)()
b = ctypes.POINTER(ctypes.c_float)()
c = ctypes.POINTER(ctypes.c_float)()

cpp.cudaMallocManaged(a, size)
cpp.cudaMallocManaged(b, size)
cpp.cudaMallocManaged(c, size)

print("BEFORE INIT")
for i in range(N):
    print(f"{a[i] = }, {b[i] = }, {c[i] = }")
    a[i] = i * 1.0
    b[i] = i * 2.0

print("AFTER INIT")
for i in range(N):
    print(f"{a[i] = }, {b[i] = }, {c[i] = }")

threadsPerBlock = 256
blocksPerGrid = (N + threadsPerBlock - 1) // threadsPerBlock

cpp.vectorAdd[blocksPerGrid, threadsPerBlock](a, b, c, N)
cpp.cudaDeviceSynchronize()

err = cpp.cudaGetLastError()
if err:
    print(f"{cpp.cudaGetErrorName(err)}: {cpp.cudaGetErrorString(err)}")

print("AFTER COMPUTE")
for i in range(N):
    print(f"{a[i] = }, {b[i] = }, {c[i] = }")

Depends On

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant