Skip to content

[SYCL][Matrix] syntax changes as preparation before moving joint matrix from experimental namespace #11215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Oct 12, 2023

Conversation

yubingex007-a11y
Copy link
Contributor

@yubingex007-a11y yubingex007-a11y commented Sep 19, 2023

As part of the effort to move joint matrix from experimental namespace to supported. A review of the API is being done as part of #7964. This results in the following changes in the syntax:
1- Add Td to joint_matrix_mad as Tc can be different from Td on the GPU, Now, we make D as an input argument to mad.
2- Change “packed” to ext_intel_packed:
3- Move EWOps (get_wi_data, wi_element, get_coord) to detail namespace) 4- add const to joint_matrix in store and mad
5 - add joint_matrix_copy/assignment function
6- add apply with coordination (change existing tests)
7- change get_coord vector type from int32_t to size_t
8- delete explicitly both = and copy ctor.

experimental namespace
As part of the effort to move joint matrix from experimental namespace to supported. A review of the API is being done as part of intel#7964. This results in the following changes in the syntax:
1- Add Td to joint_matrix_mad as Tc can be different from Td on the GPU,
Now, we make D as an input argument to mad.
2-  Change “packed” to ext_intel_packed:
3-  Move EWOps (get_wi_data, wi_element, get_coord) to detail namespace)
4- add const to joint_matrix in store and mad
5 - add joint_matrix_copy/assignment function
6- add apply with coordination (change existing tests)
7- change get_coord vector type from int32_t to size_t
8- delete explicitly both = and copy ctor.
@@ -138,9 +127,9 @@ template <typename Group, typename T, use Use, size_t Rows, size_t Cols,
__SYCL2020_DEPRECATED("get_wi_data() is deprecated for CUDA backend. Please "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this. This is not really deprecated as joint_matrix is experimental so we can just remove APIs. Deprecated means they still exist and implementations maintain them. In the case of get_wi_data. it is replaced by joint_matrix_apply

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this addressed?

Copy link
Contributor

@dkhaldi dkhaldi Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be addressed by @JackAKirk among other CUDA changes in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I will make this change as soon as this PR is merged.

@@ -99,7 +88,7 @@ class wi_data {
return jm.cuda_impl.wi_marray.size();
#else
throw runtime_error("get_wi_data is available using: "
"ext::intel::experimental::matrix::get_wi_data.",
"ext::oneapi::detail::get_wi_data.",
PI_ERROR_INVALID_DEVICE);
Copy link
Contributor

@dkhaldi dkhaldi Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not advise users to use get_wi_data. When does this runtime error occur?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i see. wi_data class here is in sycl::ext::oneapi::experimental::matrix namespace and it is for NV. the errmsg is for intel users who uses NV's wi_data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about
get_wi_data is available using: ext::oneapi::detail::get_wi_data but intel users are expected to use joint_matrix_copy instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never advise users to use anything from detail namespace. Detail namespace are implementation details and can change at any time. It is not part of documented API.

@@ -109,7 +98,7 @@ class wi_data {
return (jm.cuda_impl.wi_marray[i]);
#else
throw runtime_error("get_wi_data is available using: "
"ext::intel::experimental::matrix::get_wi_data.",
"ext::oneapi::detail::get_wi_data.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -262,7 +251,7 @@ inline __SYCL_ALWAYS_INLINE void joint_matrix_load(
Ptr, stride, __spv::MatrixLayout::ColumnMajor,
spv_scope_traits<Group>::value);
break;
case sycl::ext::intel::experimental::matrix::layout::packed:
case sycl::ext::oneapi::experimental::matrix::layout::ext_intel_packed:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: you dont need to specify the whole namespace here

std::size_t M, std::size_t K, std::size_t N, layout LayoutA,
layout LayoutB>
inline __SYCL_ALWAYS_INLINE void joint_matrix_mad(
Group sg, const joint_matrix<Group, Ta, use::a, M, K, LayoutA> &A,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D (destination) should be first, see #11007)

Copy link
Contributor

@dkhaldi dkhaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main comments:

  • We should not use get_wi_data or things in detail in tests or errors, these should be replaced with joint_matrix_apply
  • remove namespace when specifying ext_intel_packed so things look shorter

N * 4, matrix_layout::packed_b);
sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
}
joint_matrix_store(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to avoid changes in Legacy folder. Were they caused by clang-format?

ext::intel::experimental::matrix::layout::packed>
joint_matrix<
sub_group, T2, use::b, TK, TN,
ext::oneapi::experimental::matrix::layout::ext_intel_packed>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove namespace

@@ -65,8 +66,7 @@ void matrix_elem_wise_ops(big_matrix<T1, M, N> &C, big_matrix<T2, M, K> &A,
accA.template get_multi_ptr<access::decorated::no>() +
(sg_startx * TM) * K,
K);
auto wi_slice_a =
sycl::ext::intel::experimental::matrix::get_wi_data(sg, sub_a);
auto wi_slice_a = sycl::ext::oneapi::detail::get_wi_data(sg, sub_a);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use get_wi_data or detail in tests, these should be replaced with joint_matrix_apply

@@ -76,8 +76,7 @@ void matrix_elem_wise_ops(big_matrix<T1, M, N> &C, big_matrix<T2, M, K> &A,
accB.template get_multi_ptr<access::decorated::no>() +
sg_starty / SG_SZ * TN * vnniFactor,
N * vnniFactor);
auto wi_slice_b =
sycl::ext::intel::experimental::matrix::get_wi_data(sg, sub_b);
auto wi_slice_b = sycl::ext::oneapi::detail::get_wi_data(sg, sub_b);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use get_wi_data or detail in tests, these should be replaced with joint_matrix_apply

@yubingex007-a11y
Copy link
Contributor Author

sycl/test/matrix/matrix-bfloat16-test-coord-basicB.cpp previously failed
sycl/test-e2e/Matrix/get_coord_int8_matB.cpp failed previously
sycl/test-e2e/Matrix/element_wise_irreg_sum_rows.cpp can't be modified easily since:

             for (int i = 0; i < data.length() / (TK / 4); i++) { // 4 per row
               // i*SG_SIZE index is found based on the round robin
               // distribution we are using in the implementation
               sum_local_rows[row + global_idx * (TK / 4)] += data[i + row * 4];
             }

@yubingex007-a11y
Copy link
Contributor Author

I am WIP on rebasing

@yubingex007-a11y
Copy link
Contributor Author

will handle cuda testcase later

@yubingex007-a11y yubingex007-a11y temporarily deployed to WindowsCILock September 21, 2023 14:30 — with GitHub Actions Inactive
"ext::intel::experimental::matrix::get_wi_data.",
PI_ERROR_INVALID_DEVICE);
throw runtime_error(
"get_wi_data is available using: ext::oneapi::detail::get_wi_data, but "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just say: "get_wi_data is unavailable, use joint_matrix_copy instead."

@@ -0,0 +1,158 @@
// RUN: %clangxx -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -fsycl -o query-use %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an old version of the test

@dkhaldi
Copy link
Contributor

dkhaldi commented Sep 21, 2023

sycl/test/matrix/matrix-bfloat16-test-coord-basicB.cpp previously failed sycl/test-e2e/Matrix/get_coord_int8_matB.cpp failed previously sycl/test-e2e/Matrix/element_wise_irreg_sum_rows.cpp can't be modified easily since:

             for (int i = 0; i < data.length() / (TK / 4); i++) { // 4 per row
               // i*SG_SIZE index is found based on the round robin
               // distribution we are using in the implementation
               sum_local_rows[row + global_idx * (TK / 4)] += data[i + row * 4];
             }

We should probably remove this test because it assumes some distribution. Also, it does the same thing as sycl/test-e2e/Matrix/get_coord_int8_matB.cpp

@dkhaldi
Copy link
Contributor

dkhaldi commented Sep 21, 2023

I looked more carefully at https://github.com/intel/llvm/blob/sycl/sycl/test-e2e/Matrix/element_wise_irreg_sum_rows_impl.hpp
this test should be removed as it is duplicate of get_coord_matB and assumes some coordinates.

@yubingex007-a11y yubingex007-a11y temporarily deployed to WindowsCILock October 11, 2023 18:24 — with GitHub Actions Inactive
@yubingex007-a11y yubingex007-a11y temporarily deployed to WindowsCILock October 11, 2023 18:56 — with GitHub Actions Inactive
@yubingex007-a11y yubingex007-a11y temporarily deployed to WindowsCILock October 11, 2023 19:08 — with GitHub Actions Inactive
Copy link
Contributor

@dkhaldi dkhaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yubingex007-a11y yubingex007-a11y temporarily deployed to WindowsCILock October 11, 2023 19:33 — with GitHub Actions Inactive
@yubingex007-a11y yubingex007-a11y requested a review from a team October 11, 2023 23:54
@yubingex007-a11y
Copy link
Contributor Author

@intel/llvm-gatekeepers ping?

@steffenlarsen
Copy link
Contributor

@YuriPlyakhin has requested changes approval is needed.

@yubingex007-a11y
Copy link
Contributor Author

@YuriPlyakhin could you approve the pr? Dounia has answered your comments above and if there should be small changes, we can create a new pr.

@JackAKirk JackAKirk self-requested a review October 12, 2023 11:12
@dkhaldi
Copy link
Contributor

dkhaldi commented Oct 12, 2023

@intel/llvm-gatekeepers, please help merge

@dm-vodopyanov dm-vodopyanov changed the title [Matrix] syntax changes as prepraration before moving joint matrix from experimental namespace [SYCL][Matrix] syntax changes as preparation before moving joint matrix from experimental namespace Oct 12, 2023
@dm-vodopyanov
Copy link
Contributor

@intel/llvm-gatekeepers, please help merge

There is an input from Steffen above.

@dkhaldi
Copy link
Contributor

dkhaldi commented Oct 12, 2023

@intel/llvm-gatekeepers, please help merge

There is an input from Steffen above.

Correct, I missed that. Yury is OOO today but this can wait.
We should just make sure this gets merged before #11485 is merged

@dm-vodopyanov
Copy link
Contributor

@intel/llvm-gatekeepers, please help merge

There is an input from Steffen above.

Correct, I missed that. Yury is OOO today but this can wait. We should just make sure this gets merged before #11485 is merged

Updated the description of #11485

Copy link
Contributor

@YuriPlyakhin YuriPlyakhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Important comments were addressed. Test fine tuning can be done later.

@YuriPlyakhin
Copy link
Contributor

@intel/llvm-gatekeepers , I approved, please, merge.

@dm-vodopyanov dm-vodopyanov merged commit 687f579 into intel:sycl Oct 12, 2023
aelovikov-intel pushed a commit that referenced this pull request Oct 20, 2023
As discussed in #11215 this patch:

- removed mutable from `joint_matrix_cuda` (This change requires an
upstream llvm patch (https://reviews.llvm.org/rGb781c7ab574f))
- removed `get_wi_data()`

I also added back the cases that the change in the `joint_matrix_mad`
interface allows: namely when the type of C/D matrices differ. I
correspondingly updated the tests, to test the new cases that are
supported.

I also updated the support matrix for cuda in the spec doc for the newly
supported combinations.

---------

Signed-off-by: JackAKirk <[email protected]>
againull pushed a commit that referenced this pull request Oct 30, 2023
* Support one block AMD matrix core instructions for `__gfx90a__`
architecture.
* Supports `__builtin_amdgcn_mfma_i32_32x32x8i8`,
`__builtin_amdgcn_mfma_i32_16x16x16i8`,
`__builtin_amdgcn_mfma_f64_16x16x4f64`,
`__builtin_amdgcn_mfma_f32_32x32x8bf16_1k`,
`__builtin_amdgcn_mfma_f32_16x16x16bf16_1k`,
`__builtin_amdgcn_mfma_f32_32x32x8f16` and
`__builtin_amdgcn_mfma_f32_16x16x16f16` instructions.
* Add HIP matrix core support into joint_matrix documentation.

Should be merged after
- #11215

---------

Co-authored-by: Bing1 Yu <[email protected]>
Co-authored-by: mmoadeli <[email protected]>
maarquitos14 pushed a commit that referenced this pull request Oct 31, 2023
* Support one block AMD matrix core instructions for `__gfx90a__`
architecture.
* Supports `__builtin_amdgcn_mfma_i32_32x32x8i8`,
`__builtin_amdgcn_mfma_i32_16x16x16i8`,
`__builtin_amdgcn_mfma_f64_16x16x4f64`,
`__builtin_amdgcn_mfma_f32_32x32x8bf16_1k`,
`__builtin_amdgcn_mfma_f32_16x16x16bf16_1k`,
`__builtin_amdgcn_mfma_f32_32x32x8f16` and
`__builtin_amdgcn_mfma_f32_16x16x16f16` instructions.
* Add HIP matrix core support into joint_matrix documentation.

Should be merged after
- #11215

---------

Co-authored-by: Bing1 Yu <[email protected]>
Co-authored-by: mmoadeli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants