-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hardware runtime decision and utility #29
base: master
Are you sure you want to change the base?
Conversation
Hardware selection is now a runtime decision, such that precompiled binaries can execute on a system without a GPU without crashing. In addition, the target can be selected at runtime to define custom hardware selection logic. Several utility features have been added, including device runtime functions (getting the number of devices / setting the active device), basic shared memory support, and 3D execution grids. There has been careful consideration with all of these changes to avoid compatibility issues with pre-existing code.
NOTE: My advisors at JPL are interested in using HEMI / CUDA for image processing on upcoming missions, but would like to maintain full CPU compatibility. Preferably, these changes could be pushed to the master branch for support in the future. Thank you for your time! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I'm not sure some of the added complexity is in the spirit of hemi's original design. Please address the comments. I also think there's too much in this for one pull request. Please separate the CPU/GPU runtime execution selection from all of the utility pieces. That would make it easier for me to test and integrate.
Also, I won't accept any new functionality without tests. Please add tests (see the google tests already included in hemi for reference) for your new functionality.
@@ -69,6 +69,47 @@ size_t availableSharedBytesPerBlock(size_t sharedMemPerMultiprocessor, | |||
return bytes - sharedSizeBytesStatic; | |||
} | |||
|
|||
template <typename Function, typename... Arguments> | |||
cudaError_t getMaxBlocksForDevice(unsigned int& NumBlocks, ExecutionPolicy &p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment to explain what this getMaxBlocksForDevice... function is for vs. configureGrid would help me review this.
@@ -21,70 +21,272 @@ | |||
|
|||
namespace hemi | |||
{ | |||
///////Global Grid Gets////////////////////////////////////////////////////// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment doesn't match the style of other comments in the project. And what do you mean by "Gets"?
return 0; | ||
#endif | ||
} | ||
int globalThreadIndex() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the deep indent?
} | ||
int globalThreadIndex() { | ||
#ifdef HEMI_DEV_CODE | ||
return (blockIdx.x + blockIdx.y * gridDim.x + blockIdx.z * gridDim.x * gridDim.y) // get block idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so you went full-3D for the global index.... I think this needs more design thought. The common case is 1D, and this is going to make the common case slow. I would leave globalThreadIndex() alone, and then make new accessors such as : globalThreadIndex1D
, globalThreadIndex2D
, globalThreadIndex3D
. Or something like that. Do you use a lot of 3D blocks/grids?
HEMI_DEV_CALLABLE_INLINE | ||
int globalThreadCount() { | ||
#ifdef HEMI_DEV_CODE | ||
return blockDim.x * gridDim.x * blockDim.y * gridDim.y * blockDim.z * gridDim.z; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment about globalThreadIndex above.
@@ -0,0 +1,26 @@ | |||
/////////////////////////////////////////////////////////////////////////////// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider renaming this hemi.h and renaming hemi.h hemi_core.h... (So the convenient default is to include everything, but if you want to pare it down you can).
hemi/launch.h
Outdated
@@ -24,10 +24,18 @@ namespace hemi { | |||
template <typename Function, typename... Arguments> | |||
void launch(Function f, Arguments... args); | |||
|
|||
// Automatic parallel launch | |||
template <typename Function, typename... Arguments> | |||
void launch(Arguments... args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a launch without a kernel?
hemi/launch.h
Outdated
// Launch function object with an explicit execution policy / configuration | ||
template <typename Function, typename... Arguments> | ||
void launch(const ExecutionPolicy &p, Function f, Arguments... args); | ||
|
||
// Launch function with an explicit execution policy / configuration | ||
template <typename Function, typename... Arguments> | ||
void launch(const ExecutionPolicy &p, Arguments... args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
// Automatic parallel launch | ||
// | ||
template <typename Function, typename... Arguments> | ||
void launch(Arguments... args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does one pass a concrete function reference to this launch()? Can you show a working example?
if (p.getLocation() == hemi::device && queryForDevice()) | ||
{ | ||
checkCuda(configureGrid(p, Kernel<Function, Arguments...>)); | ||
Kernel << <p.getExecutionGrid(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix spacing ("Kernel << <" --> "Kernel<<<"
Hardware selection is now a runtime decision, such that precompiled
binaries can execute on a system without a GPU without crashing. In
addition, the target can be selected at runtime to define custom
hardware selection logic.
Several utility features have been added, including device runtime
functions (getting the number of devices / setting the active device),
basic shared memory support, and 3D execution grids.
There has been careful consideration with all of these changes to avoid
compatibility issues with pre-existing code.