New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[CK_TILE] Multiple-D GEMM example #2008

Draft

mozga-amd wants to merge 8 commits into develop from mozga-amd/mulit_abd

Contributor

mozga-amd commented Mar 21, 2025

Proposed changes

Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

mozga-amd added 8 commits

March 21, 2025 15:34


          Initial commit multiple-d gemm

96ac4a4


          Multiple-D UTs

38f2a9e


          Ut's format apply

74cb087


          Formatter apply

8a75cc8


          The parameter const DType ds_ptr could be treated as const void *

9c9446b


          Merge remote-tracking branch 'origin/develop' into mozga-amd/mulit_abd

f6a2cfb


          Simple HostArgs struct

8ce0634


          Remove explicit type in invoke

bbcc011

aosewski requested changes

View reviewed changes

Collaborator

aosewski left a comment

Haven't yet finished all files, but I've spotted few things which looks very suspicious to me. Please verify them.

include/ck_tile/ops/gemm/kernel/gemm_kernel.hpp Show resolved Hide resolved

include/ck_tile/ops/gemm/kernel/gemm_kernel.hpp

+                                  if(ck_tile::EnvIsEnabled(CK_TILE_ENV(CK_TILE_LOGGING)))
+                                  {
+                                      CK_TILE_ERROR(
+                                          "Can't support N that is not a multiple of NPerBlock without padding!");

Collaborator

aosewski Mar 28, 2025

Please note this is wrt tensor D.

include/ck_tile/ops/gemm/kernel/gemm_kernel.hpp

@@ @@ -399,6 +467,29 @@ struct GemmKernel @@
                           }
                       }();
+                      // TODO: enable vector write for D in ColMajor

Collaborator

aosewski Mar 28, 2025

This comment is misleading. Please remove.

include/ck_tile/ops/gemm/kernel/gemm_kernel.hpp

Comment on lines +486 to +487

		make_tuple(kargs.M, kargs.N),
		make_tuple(kargs.stride_Ds[i], 1),

Collaborator

aosewski Mar 28, 2025

Please fix this. This is currently RowMajor layout, not Column Major.

include/ck_tile/ops/gemm/kernel/gemm_kernel.hpp

-                      return make_tuple(a_tensor_view, b_tensor_view, c_tensor_view);
+                      return make_tuple(a_tensor_view,
+                                        b_tensor_view,
+                                        generate_tuple(d_tensor_view, number<NumDTensor>{}),

Collaborator

aosewski Mar 28, 2025

Can we keep coherent style? Please create a variable and use it here.

include/ck_tile/ops/epilogue/cshuffle_epilogue.hpp

-                  operator()(ODramWindow& out_dram_window, const OAccTile& o_acc_tile, void* p_smem)
+                  CK_TILE_DEVICE auto operator()(ODramWindow& out_dram_window,
+                                                 const OAccTile& o_acc_tile,
+                                                 const DDramWindow& ds_dram_window,

Collaborator

aosewski Mar 28, 2025

Please static assert its size.

include/ck_tile/ops/epilogue/cshuffle_epilogue.hpp

@@ @@ -154,6 +181,14 @@ struct CShuffleEpilogue @@
                                                             tile_distribution_pattern::thread_raked>;
                       constexpr auto dram_tile_distribution = TileEncodingPattern::Make2DStaticTileDistribution();
+                      auto d_dram_small_window = generate_tuple(

Collaborator

aosewski Mar 28, 2025

Suggested change

      
                    auto d_dram_small_window = generate_tuple(
          
                    auto d_dram_windows = generate_tuple(

include/ck_tile/ops/epilogue/cshuffle_epilogue.hpp

@@ @@ -154,6 +181,14 @@ struct CShuffleEpilogue @@
                                                             tile_distribution_pattern::thread_raked>;
                       constexpr auto dram_tile_distribution = TileEncodingPattern::Make2DStaticTileDistribution();
+                      auto d_dram_small_window = generate_tuple(
+                          [&](auto idx) { return make_tile_window(ds_dram_window[idx], dram_tile_distribution); },

Collaborator

aosewski Mar 28, 2025

You have to set tile window lengths here as for lds windows. Otherwise you have in here window of size : MPerBlock x Nperblock.

include/ck_tile/ops/epilogue/cshuffle_epilogue.hpp

+                      using elemenet_wise_output_t =
+                          decltype(load_tile(make_tile_window(out_lds_window, dram_tile_distribution)));
+                      elemenet_wise_output_t elemenet_wise_output;

Collaborator

aosewski Mar 28, 2025

Why this is actually needed? Why not just overwrite c_out_tensor - this would use much less registers

test/ck_tile/multiple_d_gemm/test_multiple_d_gemm_ut_cases.inc

Comment on lines +5 to +7

+                  constexpr int M = 3840;
+                  constexpr int N = 4096;
+                  constexpr int K = 4096;

Collaborator

aosewski Mar 28, 2025

Please don't use such large inputs in unit-tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

aosewski aosewski requested changes

illsilin Awaiting requested review from illsilin illsilin will be requested when the pull request is marked ready for review illsilin is a code owner

carlushuang Awaiting requested review from carlushuang carlushuang will be requested when the pull request is marked ready for review carlushuang is a code owner

qianfengz Awaiting requested review from qianfengz qianfengz will be requested when the pull request is marked ready for review qianfengz is a code owner

poyenc Awaiting requested review from poyenc poyenc will be requested when the pull request is marked ready for review poyenc is a code owner

geyyer Awaiting requested review from geyyer geyyer will be requested when the pull request is marked ready for review geyyer is a code owner

bartekxk Awaiting requested review from bartekxk bartekxk will be requested when the pull request is marked ready for review bartekxk is a code owner

andriy-ca Awaiting requested review from andriy-ca andriy-ca will be requested when the pull request is marked ready for review andriy-ca is a code owner

afagaj Awaiting requested review from afagaj afagaj will be requested when the pull request is marked ready for review afagaj is a code owner

asleepzzz Awaiting requested review from asleepzzz asleepzzz will be requested when the pull request is marked ready for review asleepzzz is a code owner

tenpercent Awaiting requested review from tenpercent tenpercent will be requested when the pull request is marked ready for review tenpercent is a code owner

Requested changes must be addressed to merge this pull request.

Labels

None yet