-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL][Matrix] syntax changes as preparation before moving joint matrix from experimental namespace #11215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
experimental namespace As part of the effort to move joint matrix from experimental namespace to supported. A review of the API is being done as part of intel#7964. This results in the following changes in the syntax: 1- Add Td to joint_matrix_mad as Tc can be different from Td on the GPU, Now, we make D as an input argument to mad. 2- Change “packed” to ext_intel_packed: 3- Move EWOps (get_wi_data, wi_element, get_coord) to detail namespace) 4- add const to joint_matrix in store and mad 5 - add joint_matrix_copy/assignment function 6- add apply with coordination (change existing tests) 7- change get_coord vector type from int32_t to size_t 8- delete explicitly both = and copy ctor.
@@ -138,9 +127,9 @@ template <typename Group, typename T, use Use, size_t Rows, size_t Cols, | |||
__SYCL2020_DEPRECATED("get_wi_data() is deprecated for CUDA backend. Please " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove this. This is not really deprecated as joint_matrix is experimental so we can just remove APIs. Deprecated means they still exist and implementations maintain them. In the case of get_wi_data. it is replaced by joint_matrix_apply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this addressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be addressed by @JackAKirk among other CUDA changes in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I will make this change as soon as this PR is merged.
@@ -99,7 +88,7 @@ class wi_data { | |||
return jm.cuda_impl.wi_marray.size(); | |||
#else | |||
throw runtime_error("get_wi_data is available using: " | |||
"ext::intel::experimental::matrix::get_wi_data.", | |||
"ext::oneapi::detail::get_wi_data.", | |||
PI_ERROR_INVALID_DEVICE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not advise users to use get_wi_data
. When does this runtime error occur?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i see. wi_data class here is in sycl::ext::oneapi::experimental::matrix namespace and it is for NV. the errmsg is for intel users who uses NV's wi_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
get_wi_data is available using: ext::oneapi::detail::get_wi_data but intel users are expected to use joint_matrix_copy instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should never advise users to use anything from detail namespace. Detail namespace are implementation details and can change at any time. It is not part of documented API.
@@ -109,7 +98,7 @@ class wi_data { | |||
return (jm.cuda_impl.wi_marray[i]); | |||
#else | |||
throw runtime_error("get_wi_data is available using: " | |||
"ext::intel::experimental::matrix::get_wi_data.", | |||
"ext::oneapi::detail::get_wi_data.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
@@ -262,7 +251,7 @@ inline __SYCL_ALWAYS_INLINE void joint_matrix_load( | |||
Ptr, stride, __spv::MatrixLayout::ColumnMajor, | |||
spv_scope_traits<Group>::value); | |||
break; | |||
case sycl::ext::intel::experimental::matrix::layout::packed: | |||
case sycl::ext::oneapi::experimental::matrix::layout::ext_intel_packed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: you dont need to specify the whole namespace here
std::size_t M, std::size_t K, std::size_t N, layout LayoutA, | ||
layout LayoutB> | ||
inline __SYCL_ALWAYS_INLINE void joint_matrix_mad( | ||
Group sg, const joint_matrix<Group, Ta, use::a, M, K, LayoutA> &A, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D (destination) should be first, see #11007)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main comments:
- We should not use
get_wi_data
or things indetail
in tests or errors, these should be replaced with joint_matrix_apply - remove namespace when specifying ext_intel_packed so things look shorter
N * 4, matrix_layout::packed_b); | ||
sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c); | ||
} | ||
joint_matrix_store( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to avoid changes in Legacy folder. Were they caused by clang-format?
ext::intel::experimental::matrix::layout::packed> | ||
joint_matrix< | ||
sub_group, T2, use::b, TK, TN, | ||
ext::oneapi::experimental::matrix::layout::ext_intel_packed> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove namespace
@@ -65,8 +66,7 @@ void matrix_elem_wise_ops(big_matrix<T1, M, N> &C, big_matrix<T2, M, K> &A, | |||
accA.template get_multi_ptr<access::decorated::no>() + | |||
(sg_startx * TM) * K, | |||
K); | |||
auto wi_slice_a = | |||
sycl::ext::intel::experimental::matrix::get_wi_data(sg, sub_a); | |||
auto wi_slice_a = sycl::ext::oneapi::detail::get_wi_data(sg, sub_a); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not use get_wi_data or detail in tests, these should be replaced with joint_matrix_apply
@@ -76,8 +76,7 @@ void matrix_elem_wise_ops(big_matrix<T1, M, N> &C, big_matrix<T2, M, K> &A, | |||
accB.template get_multi_ptr<access::decorated::no>() + | |||
sg_starty / SG_SZ * TN * vnniFactor, | |||
N * vnniFactor); | |||
auto wi_slice_b = | |||
sycl::ext::intel::experimental::matrix::get_wi_data(sg, sub_b); | |||
auto wi_slice_b = sycl::ext::oneapi::detail::get_wi_data(sg, sub_b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not use get_wi_data or detail in tests, these should be replaced with joint_matrix_apply
sycl/test/matrix/matrix-bfloat16-test-coord-basicB.cpp previously failed
|
I am WIP on rebasing |
will handle cuda testcase later |
"ext::intel::experimental::matrix::get_wi_data.", | ||
PI_ERROR_INVALID_DEVICE); | ||
throw runtime_error( | ||
"get_wi_data is available using: ext::oneapi::detail::get_wi_data, but " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just say: "get_wi_data is unavailable, use joint_matrix_copy instead."
sycl/test/matrix/query-use.cpp
Outdated
@@ -0,0 +1,158 @@ | |||
// RUN: %clangxx -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -fsycl -o query-use %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an old version of the test
We should probably remove this test because it assumes some distribution. Also, it does the same thing as sycl/test-e2e/Matrix/get_coord_int8_matB.cpp |
I looked more carefully at https://github.com/intel/llvm/blob/sycl/sycl/test-e2e/Matrix/element_wise_irreg_sum_rows_impl.hpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@intel/llvm-gatekeepers ping? |
@YuriPlyakhin has requested changes approval is needed. |
@YuriPlyakhin could you approve the pr? Dounia has answered your comments above and if there should be small changes, we can create a new pr. |
@intel/llvm-gatekeepers, please help merge |
There is an input from Steffen above. |
Correct, I missed that. Yury is OOO today but this can wait. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Important comments were addressed. Test fine tuning can be done later.
@intel/llvm-gatekeepers , I approved, please, merge. |
As discussed in #11215 this patch: - removed mutable from `joint_matrix_cuda` (This change requires an upstream llvm patch (https://reviews.llvm.org/rGb781c7ab574f)) - removed `get_wi_data()` I also added back the cases that the change in the `joint_matrix_mad` interface allows: namely when the type of C/D matrices differ. I correspondingly updated the tests, to test the new cases that are supported. I also updated the support matrix for cuda in the spec doc for the newly supported combinations. --------- Signed-off-by: JackAKirk <[email protected]>
* Support one block AMD matrix core instructions for `__gfx90a__` architecture. * Supports `__builtin_amdgcn_mfma_i32_32x32x8i8`, `__builtin_amdgcn_mfma_i32_16x16x16i8`, `__builtin_amdgcn_mfma_f64_16x16x4f64`, `__builtin_amdgcn_mfma_f32_32x32x8bf16_1k`, `__builtin_amdgcn_mfma_f32_16x16x16bf16_1k`, `__builtin_amdgcn_mfma_f32_32x32x8f16` and `__builtin_amdgcn_mfma_f32_16x16x16f16` instructions. * Add HIP matrix core support into joint_matrix documentation. Should be merged after - #11215 --------- Co-authored-by: Bing1 Yu <[email protected]> Co-authored-by: mmoadeli <[email protected]>
* Support one block AMD matrix core instructions for `__gfx90a__` architecture. * Supports `__builtin_amdgcn_mfma_i32_32x32x8i8`, `__builtin_amdgcn_mfma_i32_16x16x16i8`, `__builtin_amdgcn_mfma_f64_16x16x4f64`, `__builtin_amdgcn_mfma_f32_32x32x8bf16_1k`, `__builtin_amdgcn_mfma_f32_16x16x16bf16_1k`, `__builtin_amdgcn_mfma_f32_32x32x8f16` and `__builtin_amdgcn_mfma_f32_16x16x16f16` instructions. * Add HIP matrix core support into joint_matrix documentation. Should be merged after - #11215 --------- Co-authored-by: Bing1 Yu <[email protected]> Co-authored-by: mmoadeli <[email protected]>
As part of the effort to move joint matrix from experimental namespace to supported. A review of the API is being done as part of #7964. This results in the following changes in the syntax:
1- Add Td to joint_matrix_mad as Tc can be different from Td on the GPU, Now, we make D as an input argument to mad.
2- Change “packed” to ext_intel_packed:
3- Move EWOps (get_wi_data, wi_element, get_coord) to detail namespace) 4- add const to joint_matrix in store and mad
5 - add joint_matrix_copy/assignment function
6- add apply with coordination (change existing tests)
7- change get_coord vector type from int32_t to size_t
8- delete explicitly both = and copy ctor.