new feature: gpu allocate and deallocate #41

bd4 · 2020-12-09T20:43:02Z

Is there any interest in adding gpu independent allocate and deallocate wrappers (for kokkos_malloc) to FLCL? What about things like Memcpy and Memset?

I am working on a port of the GENE fusion code (http://genecode.org/) to run on AMD and Intel GPUs (it is already working on CUDA). It currently uses a mix of Fortran (original CPU only code) and CUDA C/C++, together with the gtensor (https://github.com/wdmapp/gtensor) multi-d C++ library, with lazy array evaluation and automatic generation of kernels.

Currently the memory management lives in Fortran, via a family of "reg_storage" types that call regular allocate for CPU or Fortran wrapped cudaMalloc underneath. gtensor has an experimental gpuMalloc wrapper that calls hipMalloc or sycl::malloc under the hood, but we are interested in leveraging existing solutions and collaborating as much as possible. We are exploring how gtensor's lazy evaluation model could be implemented on top of RAJA or Kokkos, and /or inter-operate with using mdspan-compatible view types. And because we are working on a Fortran application, ease of interoperability is also a major factor.

womeld · 2020-12-09T22:03:26Z

If I understand what you're looking for correctly, essentially GPU allocations that are addressable from Fortran, then I think FLCL already provides this functionality. I will explain what I mean, but by all means let me know if I've misunderstood your requirements.

If we look at the 'allocate_view' collection of methods, on a system with both GPUs and some sort of coherent memory architecture between the CPUs and GPUs (e.g. NVIDIA's UVM on say a Power9 type system), then when you call a FLCL allocate_view from Fortran, behind the scenes, a Fortran compatible memory layout Kokkos View will be allocated on the default execution space. This View's backing allocation address (and a handle for reference counting) will be passed back to Fortran, and a Fortran array will be shaped around the memory allocation (which is then usable like any other Fortran array). But while I make an NVIDIA example, there is no reason it would not work with a similar setup with AMD hardware (if they have something like UVM, which I am not aware of if it is the case) and the Kokkos ROCm backend. (I assume an Intel/SYCL backend is coming for Kokkos based on DOE HPC system announcements and purchases, but I do not see it in the develop branch.)

But for any system with GPUs, regardless of whether there is a coherent/automatic memory transfer as outlined above, there are also the 'allocate_dualview' methods. (I am currently finishing up the implementation and unit tests as my current work item, see #40 for progress.) For dualviews, it is largely as above but allocations are made in both the memory space for the default CPU execution space, and the default GPU execution space. Then (in C++) requests are made to the dualview to mark a memory space as dirty, and sync to a memory space. So then you can control the transfer of memory from one backing store to the other manually.

Perhaps one other concern is interoperable scalar types, limited by ISO_C_BINDING. Currently FLCL supports logicals/bools, 32/64 bit integers, and 32/64 real/floating point with allocate_view and allocate_dualview.

Let me know if I've misunderstood what you're looking for.

bd4 · 2020-12-09T22:17:08Z

That is not a perfect match but it does provide some of what we would need. We also need the ability to make allocations on device only, that aren't addressable from Fortran. I guess this is an odd requirement; the reg_storage Fortran layer is designed to support different models, including allocating on device only and having all the actual access happen from C++ code. In any case it's good to know that Fortran initiated allocation is supported!

womeld · 2020-12-09T22:24:13Z

Ah, I see the distinction. A potential workaround would be to use a dummy Fortran array (type/rank/size conforming, of course) when invoking the allocate method you wanted. You would still receive the (dual)view handle so that you could access the (dual)view as needed by passing it to C++ kernels. That would probably get you what you want on coherent systems, but it would be a wasted CPU allocation on non-coherent ones. But neither case is a GPU-only allocation.

AndrewGaspar · 2020-12-09T23:12:31Z

Hey, @bd4

Is an API like this kind of what you had in mind?

real(REAL64), intent(inout) :: data(:,:)

type(kokkos_r64_2d_view_t) :: device_view
call device_view%allocate(size(data,1), size(data,2))
call kokkos_deep_copy(device_view, data)

! do something with the view

call kokkos_deep_copy(data, device_view)

Or do you want completely unmanaged memory?

womeld · 2020-12-09T23:17:28Z

Separately from @AndrewGaspar proposal above, we talked about this and came up with a GPU-only idea. If you compile FLCL against a GPU backend, then the allocate_view should result in a GPU allocation. Usually we think of allocating views on UVM systems (switching back to NVIDIA terminology), since we want to access them from Fortran. But if you are not needing Fortran access, then you simply wouldn't specify to Kokkos to force UVM allocations (which we typically do in practice, but it is not prescribed in the library). My comment about needing a type/rank/size conforming "dummy" array would still apply though. I didn't think of this at first since we use that API on UVM systems only for our use cases.

bd4 · 2020-12-10T01:17:10Z

@AndrewGaspar that is what I had in mind. Would the device_view be device only memory, i.e. cudaMalloc / hipMalloc? The issue we are running into is that for HIP/ROCm on AMD GPUs, managed memory is not performant (currently it's just pinned host memory).

For the CUDA port, we actually only use UVM, even though the reg_storage class hierarchy supports device only allocations. This is going to need to change for AMD, and there is discussion of spinning off the reg_storage Fortran classes to use in other projects. For AMD, we are still discussing options. A mirror view with explicit synchronization points (like your dual view?), is one of the things we are considering. Using device memory, and restructuring the code so all the host accesses are removed for the duration of the timestepping (clearer delineation between device and host regions, with copies between as needed), is another option. Managed memory was used initially as a rapid prototyping tool, but now that the whole timestep calculation is running on GPU, it's less necessary.

AndrewGaspar · 2020-12-10T21:30:07Z

Would the device_view be device only memory, i.e. cudaMalloc / hipMalloc?

That's right - we wouldn't even map the array in an addressable way from Fortran - though maybe we could if cuda fortran was enabled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new feature: gpu allocate and deallocate #41

new feature: gpu allocate and deallocate #41

bd4 commented Dec 9, 2020

womeld commented Dec 9, 2020

bd4 commented Dec 9, 2020

womeld commented Dec 9, 2020

AndrewGaspar commented Dec 9, 2020

womeld commented Dec 9, 2020

bd4 commented Dec 10, 2020

AndrewGaspar commented Dec 10, 2020 •

edited

Loading

new feature: gpu allocate and deallocate #41

new feature: gpu allocate and deallocate #41

Comments

bd4 commented Dec 9, 2020

womeld commented Dec 9, 2020

bd4 commented Dec 9, 2020

womeld commented Dec 9, 2020

AndrewGaspar commented Dec 9, 2020

womeld commented Dec 9, 2020

bd4 commented Dec 10, 2020

AndrewGaspar commented Dec 10, 2020 • edited Loading

AndrewGaspar commented Dec 10, 2020 •

edited

Loading