Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware runtime decision and utility #29

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Brandon-Wilson
Copy link

Hardware selection is now a runtime decision, such that precompiled
binaries can execute on a system without a GPU without crashing. In
addition, the target can be selected at runtime to define custom
hardware selection logic.

Several utility features have been added, including device runtime
functions (getting the number of devices / setting the active device),
basic shared memory support, and 3D execution grids.

There has been careful consideration with all of these changes to avoid
compatibility issues with pre-existing code.

Hardware selection is now a runtime decision, such that precompiled
binaries can execute on a system without a GPU without crashing.  In
addition, the target can be selected at runtime to define custom
hardware selection logic.

Several utility features have been added, including device runtime
functions (getting the number of devices / setting the active device),
basic shared memory support, and 3D execution grids.

There has been careful consideration with all of these changes to avoid
compatibility issues with pre-existing code.
@Brandon-Wilson
Copy link
Author

NOTE: My advisors at JPL are interested in using HEMI / CUDA for image processing on upcoming missions, but would like to maintain full CPU compatibility. Preferably, these changes could be pushed to the master branch for support in the future.

Thank you for your time!

Copy link
Owner

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm not sure some of the added complexity is in the spirit of hemi's original design. Please address the comments. I also think there's too much in this for one pull request. Please separate the CPU/GPU runtime execution selection from all of the utility pieces. That would make it easier for me to test and integrate.

Also, I won't accept any new functionality without tests. Please add tests (see the google tests already included in hemi for reference) for your new functionality.

@@ -69,6 +69,47 @@ size_t availableSharedBytesPerBlock(size_t sharedMemPerMultiprocessor,
return bytes - sharedSizeBytesStatic;
}

template <typename Function, typename... Arguments>
cudaError_t getMaxBlocksForDevice(unsigned int& NumBlocks, ExecutionPolicy &p)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment to explain what this getMaxBlocksForDevice... function is for vs. configureGrid would help me review this.

@@ -21,70 +21,272 @@

namespace hemi
{
///////Global Grid Gets//////////////////////////////////////////////////////
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't match the style of other comments in the project. And what do you mean by "Gets"?

return 0;
#endif
}
int globalThreadIndex() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the deep indent?

}
int globalThreadIndex() {
#ifdef HEMI_DEV_CODE
return (blockIdx.x + blockIdx.y * gridDim.x + blockIdx.z * gridDim.x * gridDim.y) // get block idx
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so you went full-3D for the global index.... I think this needs more design thought. The common case is 1D, and this is going to make the common case slow. I would leave globalThreadIndex() alone, and then make new accessors such as : globalThreadIndex1D, globalThreadIndex2D, globalThreadIndex3D. Or something like that. Do you use a lot of 3D blocks/grids?

HEMI_DEV_CALLABLE_INLINE
int globalThreadCount() {
#ifdef HEMI_DEV_CODE
return blockDim.x * gridDim.x * blockDim.y * gridDim.y * blockDim.z * gridDim.z;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment about globalThreadIndex above.

@@ -0,0 +1,26 @@
///////////////////////////////////////////////////////////////////////////////
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider renaming this hemi.h and renaming hemi.h hemi_core.h... (So the convenient default is to include everything, but if you want to pare it down you can).

hemi/launch.h Outdated
@@ -24,10 +24,18 @@ namespace hemi {
template <typename Function, typename... Arguments>
void launch(Function f, Arguments... args);

// Automatic parallel launch
template <typename Function, typename... Arguments>
void launch(Arguments... args);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a launch without a kernel?

hemi/launch.h Outdated
// Launch function object with an explicit execution policy / configuration
template <typename Function, typename... Arguments>
void launch(const ExecutionPolicy &p, Function f, Arguments... args);

// Launch function with an explicit execution policy / configuration
template <typename Function, typename... Arguments>
void launch(const ExecutionPolicy &p, Arguments... args);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

// Automatic parallel launch
//
template <typename Function, typename... Arguments>
void launch(Arguments... args)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does one pass a concrete function reference to this launch()? Can you show a working example?

if (p.getLocation() == hemi::device && queryForDevice())
{
checkCuda(configureGrid(p, Kernel<Function, Arguments...>));
Kernel << <p.getExecutionGrid(),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix spacing ("Kernel << <" --> "Kernel<<<"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants