Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CK_TILE] Multiple-D GEMM example #2008

Draft
wants to merge 8 commits into
base: develop
Choose a base branch
from
Draft

Conversation

mozga-amd
Copy link
Contributor

Proposed changes

Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

Copy link
Collaborator

@aosewski aosewski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't yet finished all files, but I've spotted few things which looks very suspicious to me. Please verify them.

if(ck_tile::EnvIsEnabled(CK_TILE_ENV(CK_TILE_LOGGING)))
{
CK_TILE_ERROR(
"Can't support N that is not a multiple of NPerBlock without padding!");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note this is wrt tensor D.

@@ -399,6 +467,29 @@ struct GemmKernel
}
}();

// TODO: enable vector write for D in ColMajor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is misleading. Please remove.

Comment on lines +486 to +487
make_tuple(kargs.M, kargs.N),
make_tuple(kargs.stride_Ds[i], 1),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix this. This is currently RowMajor layout, not Column Major.

return make_tuple(a_tensor_view, b_tensor_view, c_tensor_view);
return make_tuple(a_tensor_view,
b_tensor_view,
generate_tuple(d_tensor_view, number<NumDTensor>{}),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep coherent style? Please create a variable and use it here.

operator()(ODramWindow& out_dram_window, const OAccTile& o_acc_tile, void* p_smem)
CK_TILE_DEVICE auto operator()(ODramWindow& out_dram_window,
const OAccTile& o_acc_tile,
const DDramWindow& ds_dram_window,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please static assert its size.

@@ -154,6 +181,14 @@ struct CShuffleEpilogue
tile_distribution_pattern::thread_raked>;
constexpr auto dram_tile_distribution = TileEncodingPattern::Make2DStaticTileDistribution();

auto d_dram_small_window = generate_tuple(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto d_dram_small_window = generate_tuple(
auto d_dram_windows = generate_tuple(

@@ -154,6 +181,14 @@ struct CShuffleEpilogue
tile_distribution_pattern::thread_raked>;
constexpr auto dram_tile_distribution = TileEncodingPattern::Make2DStaticTileDistribution();

auto d_dram_small_window = generate_tuple(
[&](auto idx) { return make_tile_window(ds_dram_window[idx], dram_tile_distribution); },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to set tile window lengths here as for lds windows. Otherwise you have in here window of size : MPerBlock x Nperblock.


using elemenet_wise_output_t =
decltype(load_tile(make_tile_window(out_lds_window, dram_tile_distribution)));
elemenet_wise_output_t elemenet_wise_output;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is actually needed? Why not just overwrite c_out_tensor - this would use much less registers

Comment on lines +5 to +7
constexpr int M = 3840;
constexpr int N = 4096;
constexpr int K = 4096;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use such large inputs in unit-tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants