Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give user the ability to specify buffer offsets in enqueue_transform #29

Merged
merged 11 commits into from
Jan 26, 2017

Conversation

SyamGadde
Copy link
Contributor

Sorry for the raft of pull requests -- these are all modifications I've been using successfully for a while and have found very useful, and would welcome your input.

This set of commits adds in_offsets and out_offsets to enqueue_transform(), which, if specified, triggers the creation of sub-buffers under the hood. For those arrays that include multiple non-transformed axes that can't be "collapsed" into a single axis, this change allows the user to slice the input array into appropriate chunks (using the standard slice operator) and simply send x.base_data and x.offset as the inputs/outputs -- though only as long as the offset is a multiple of CL_DEVICE_MEM_BASE_ADDR_ALIGN. It also allows for transforming arrays that have non-zero offsets. For example, to partially address issue #10 without needing to create copies of the original array or create our own OpenCL sub-buffers:

import gpyfft
import pyopencl
import pyopencl.array
context = pyopencl.create_some_context()
queue = pyopencl.CommandQueue(context)
import numpy; a = numpy.zeros([1024, 256, 13], order='F', dtype=numpy.complex64)
a[:,:,:] = numpy.arange(256)[numpy.newaxis, :, numpy.newaxis]
a_gpu = pyopencl.array.to_device(queue, a); b_gpu = pyopencl.array.zeros_like(a_gpu)

# notice here we use a sliced array which has a non-zero offset
transform = gpyfft.fft.FFT(context, queue, a_gpu[:,:,1], axes = (1,), out_array=b_gpu[:,:,1])

# here we transform using the high-level API (the non-zero offset would have failed here before these commits)
(event,) = transform.enqueue()
event.wait()
result1 = b_gpu.get()[:,:,1]

# here we transform all 13 chunks using the low-level API (so we don't have to re-create the plan)
# (all but the first chunk have non-zero offsets)
events = [
    transform.plan.enqueue_transform((queue,), (a_gpu.base_data,), (b_gpu.base_data,), in_offsets=(a_gpu[:,:,i].offset,), out_offsets=(b_gpu[:,:,i].offset,))[0]
    for i in range(13)
]
for event in events:
    event.wait()
result2 = b_gpu.get()

@geggo
Copy link
Owner

geggo commented Jan 25, 2017

Hi,
thanks for providing a way to allow more general memory layouts and resolve issue #10. I need more time to look into this.

After having a quick glance at the code I have some comments/questions:

  • I would like to keep the low-level wrapper a simple, thin layer around the clFFT library, without adding too much functionality not present in the library. Just for the sake of maintainability. Therefor I would prefer if all the sub buffer handling takes place in the high-level wrapper, ideally in pure python, relying on pyopencl. Do you thinks that is possible and makes sense?

@SyamGadde
Copy link
Contributor Author

Sure, I don't see why it can't be done all in the high-level API -- in that case the same thing could be done in enqueue() instead. I'll amend this in a bit...

@SyamGadde
Copy link
Contributor Author

OK, one reason I changed the low-level enqueue_transform was because that was the only way I could re-use the baked plan for multiple arrays (or multiple slices of the same array). Would it be acceptable to extend the high-level enqueue() to take optional data/result keyword arguments? If specified, they would override the stored data and result attributes and allow users to call enqueue repeatedly. It could double check that the shape/strides/dtype match the original data used to create the plan. If those arguments are not specified, it would use the stored values as normal.

@geggo
Copy link
Owner

geggo commented Jan 25, 2017 via email

@SyamGadde
Copy link
Contributor Author

I agree that automatically batching across non-transformed axes would be great to integrate into gpyfft. I've created a somewhat generic mechanism to do so outside of gpyfft, but I'm afraid it's not very elegant -- perhaps it's because I haven't yet stumbled on the "trivial" solution. Anyway, when I clean it up, I'll be happy to share it.

There may still be other reasons to re-use a plan created through the high-level API, so I'll create a separate enqueue function (that takes arrays), as you suggest, and test it.

@SyamGadde
Copy link
Contributor Author

Latest commits limit the entirety of the changes to fft.py. I created a new function enqueue_arrays which take data and (optional) result arrays to override the stored data/result arrays. enqueue just calls enqueue_arrays with the stored arrays since otherwise they just do the same thing.

I also found a pyopencl function that creates the sub-buffer for me so no need to call out to the OpenCL library directly.

The responsibility for automatic non-transformed axis batching would need to be shared between __init__ and enqueue_arrays because __init__ creates the clFFT plan, and you need to know how you will batch the array before creating a plan.

Here is an example of usage:

import gpyfft
import numpy
import numpy.fft
import pyopencl
import pyopencl.array
context = pyopencl.create_some_context()
queue = pyopencl.CommandQueue(context)
import numpy
a = numpy.zeros([1024, 256, 13], order='F', dtype=numpy.complex64)
a[:,:,:] = numpy.arange(256)[numpy.newaxis, :, numpy.newaxis]
a_gpu = pyopencl.array.to_device(queue, a)
b_gpu = pyopencl.array.zeros_like(a_gpu)
b2_gpu = pyopencl.array.zeros_like(a_gpu)

# notice here we use a sliced array which has a non-zero offset
transform = gpyfft.fft.FFT(context, queue, a_gpu[:,:,1], axes = (1,), out_array=b_gpu[:,:,1])

# here we transform using the high-level API (the non-zero offset would have failed here before these commits)
(event,) = transform.enqueue()
event.wait()
result1 = b_gpu.get()[:,:,1]
assert numpy.allclose(result1, numpy.fft.fftn(a[:,:,1], axes=(1,)))

# here we transform all 13 chunks
# (all but the first chunk have non-zero offsets)
events = [
    transform.enqueue_arrays(data=a_gpu[:,:,i], result=b2_gpu[:,:,i])[0]
    for i in range(13)
]
for event in events:
    event.wait()
result2 = b2_gpu.get()
assert numpy.allclose(result2, numpy.fft.fftn(a, axes=(1,)))

@geggo
Copy link
Owner

geggo commented Jan 26, 2017

Excellent, thanks for your contribution!

@geggo geggo merged commit 7a8bc41 into geggo:master Jan 26, 2017
geggo added a commit that referenced this pull request Jan 26, 2017
* 'master' of https://github.com/geggo/gpyfft:
  Merge PR #29: Accept pyopencl arrays with nonzero offsets (PR #29), add enqueue_arrays method
  Allow buffers to be of PyOpenCL-type PooledBuffer (in addition to the standard Buffer)
  Don't send an empty event array (clFFT thinks the sky is falling)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants