-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new feature: gpu allocate and deallocate #41
Comments
If I understand what you're looking for correctly, essentially GPU allocations that are addressable from Fortran, then I think FLCL already provides this functionality. I will explain what I mean, but by all means let me know if I've misunderstood your requirements. If we look at the 'allocate_view' collection of methods, on a system with both GPUs and some sort of coherent memory architecture between the CPUs and GPUs (e.g. NVIDIA's UVM on say a Power9 type system), then when you call a FLCL But for any system with GPUs, regardless of whether there is a coherent/automatic memory transfer as outlined above, there are also the 'allocate_dualview' methods. (I am currently finishing up the implementation and unit tests as my current work item, see #40 for progress.) For dualviews, it is largely as above but allocations are made in both the memory space for the default CPU execution space, and the default GPU execution space. Then (in C++) requests are made to the dualview to mark a memory space as dirty, and sync to a memory space. So then you can control the transfer of memory from one backing store to the other manually. Perhaps one other concern is interoperable scalar types, limited by ISO_C_BINDING. Currently FLCL supports logicals/bools, 32/64 bit integers, and 32/64 real/floating point with allocate_view and allocate_dualview. Let me know if I've misunderstood what you're looking for. |
That is not a perfect match but it does provide some of what we would need. We also need the ability to make allocations on device only, that aren't addressable from Fortran. I guess this is an odd requirement; the reg_storage Fortran layer is designed to support different models, including allocating on device only and having all the actual access happen from C++ code. In any case it's good to know that Fortran initiated allocation is supported! |
Ah, I see the distinction. A potential workaround would be to use a dummy Fortran array (type/rank/size conforming, of course) when invoking the |
Hey, @bd4 Is an API like this kind of what you had in mind? real(REAL64), intent(inout) :: data(:,:)
type(kokkos_r64_2d_view_t) :: device_view
call device_view%allocate(size(data,1), size(data,2))
call kokkos_deep_copy(device_view, data)
! do something with the view
call kokkos_deep_copy(data, device_view) Or do you want completely unmanaged memory? |
Separately from @AndrewGaspar proposal above, we talked about this and came up with a GPU-only idea. If you compile FLCL against a GPU backend, then the |
@AndrewGaspar that is what I had in mind. Would the device_view be device only memory, i.e. cudaMalloc / hipMalloc? The issue we are running into is that for HIP/ROCm on AMD GPUs, managed memory is not performant (currently it's just pinned host memory). For the CUDA port, we actually only use UVM, even though the reg_storage class hierarchy supports device only allocations. This is going to need to change for AMD, and there is discussion of spinning off the reg_storage Fortran classes to use in other projects. For AMD, we are still discussing options. A mirror view with explicit synchronization points (like your dual view?), is one of the things we are considering. Using device memory, and restructuring the code so all the host accesses are removed for the duration of the timestepping (clearer delineation between device and host regions, with copies between as needed), is another option. Managed memory was used initially as a rapid prototyping tool, but now that the whole timestep calculation is running on GPU, it's less necessary. |
That's right - we wouldn't even map the array in an addressable way from Fortran - though maybe we could if cuda fortran was enabled. |
Is there any interest in adding gpu independent allocate and deallocate wrappers (for kokkos_malloc) to FLCL? What about things like Memcpy and Memset?
I am working on a port of the GENE fusion code (http://genecode.org/) to run on AMD and Intel GPUs (it is already working on CUDA). It currently uses a mix of Fortran (original CPU only code) and CUDA C/C++, together with the gtensor (https://github.com/wdmapp/gtensor) multi-d C++ library, with lazy array evaluation and automatic generation of kernels.
Currently the memory management lives in Fortran, via a family of "reg_storage" types that call regular allocate for CPU or Fortran wrapped cudaMalloc underneath. gtensor has an experimental gpuMalloc wrapper that calls hipMalloc or sycl::malloc under the hood, but we are interested in leveraging existing solutions and collaborating as much as possible. We are exploring how gtensor's lazy evaluation model could be implemented on top of RAJA or Kokkos, and /or inter-operate with using mdspan-compatible view types. And because we are working on a Fortran application, ease of interoperability is also a major factor.
The text was updated successfully, but these errors were encountered: