Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend TestingGuide (adds notes on formatting) #216

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

banach-space
Copy link
Contributor

No description provided.

Copy link
Contributor

@joker-eph joker-eph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

* Follow existing conventions.

By applying these best practices, we leverage available tools (e.g., test
function names) to make tests easier to discover and maintain.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a good addition: I would still add some documentation in comment to a test: it's not always clear just from a test name what's going on, and the same logic as in code applies (document the why, the context, the "what's no obvious from the code").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, comments are also super important, thanks for raising this!

I have some ideas on "comments", so let me treat this as encouragement to extend this PR with a dedicated section on documentation :)

Copy link
Member

@jpienaar jpienaar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Yes discoverability is important here.

@rengolin
Copy link
Member

Awesome addition. All good practices that we've been telling each other for years but always too lazy to encode in a document. :)


When adding new tests, strive to follow these two key rules:

1. **Follow the existing naming and whitespace style.**
Copy link
Member

@kuhar kuhar Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say 'existing style', what scope do you have in mind? It could be that the style of a specific directory is inconsistent with the style used in majority of other dialects etc. IE, do we want to make things globally or locally consistent, or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m deliberately avoiding specifying this :)

Making things globally consistent would be great for... consistency. However, that assumes one size fits all, which isn't always realistic. Given the current state of things, I also feel this would be an impractical goal.
The same applies to per-file vs per-directory consistency. For example, "mlir/test/Dialect/Vector" has around 50 files - enforcing 100% consistency across them would require significant effort and churn.

My view is that contributors and reviewers should make the right call on a case-by-case basis while being mindful of:

  • Reinventing the wheel in terms of formatting.
  • New tests ignoring existing conventions (whether at the file or directory level).

Ultimately, I see this as a gradual process. Let’s start with these guidelines to help "correct the course." Over time, we can refine them - possibly transitioning from guideline to requirement if it makes sense.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've worked on a few different codebases that approached this differently: some preferred local consistency, some global (e.g., google3). I agree that this doesn't have to one or the other and there's some room for interpretation, but at the same time I'd like to have some codified incentive that will push us towards higher quality tests/code.

For example, we could say we prefer global consistency unless the local standards are either higher or more specialized. If a PR adheres to the global standards but sticks like a sore thumb compared to other local code, I'd think that in most cases sending a separate clean up PR would be a reasonable request.

To reduce cognitive load, use consistent names across MLIR and FileCheck. Also,
instead of using generic names like `%arg0`, encode some additional context by
using names from existing documentation (e.g. from Op definition, see e.g.
[vector.maskedload](https://mlir.llvm.org/docs/Dialects/Vector/#vectormaskedload-vectormaskedloadop)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[vector.maskedload](https://mlir.llvm.org/docs/Dialects/Vector/#vectormaskedload-vectormaskedloadop)
[`vector.maskedload`](https://mlir.llvm.org/docs/Dialects/Vector/#vectormaskedload-vectormaskedloadop)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am slightly worried that this won't render well, but lets try!

Comment on lines +437 to +449
// CHECK-LABEL: func @maskedload_regression_3(
// CHECK-SAME: %[[A0:.*]]: memref<16xf32>,
// CHECK-SAME: %[[A1:.*]]: vector<16xf32>) -> vector<16xf32> {
// CHECK: return %[[A1]] : vector<16xf32>
func.func @maskedload_regression_3(%arg0: memref<16xf32>, %arg1: vector<16xf32>) -> vector<16xf32> {
%c0 = arith.constant 0 : index
%vec_i1 = vector.constant_vec_i1 [0] : vector<16xi1>

%ld = vector.maskedload %arg0[%c0], %vec_i1, %arg1
: memref<16xf32>, vector<16xi1>, vector<16xf32> into vector<16xf32>

return %ld : vector<16xf32>
}
Copy link
Member

@kuhar kuhar Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all 3 examples to drive your point or could we limit this to just two?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Generalization Takes Three Examples" :)

It allows me to go into a bit of nuance in this paragraph (i.e. to identify the commonalities):

  ##### Step 2: Improve Test Naming

  Instead of using "regression" (which doesn't add unique information), rename
  tests based on key attributes:

  * All examples test the `vector.maskedload` to `vector.load` lowering.
  * The first test uses a _dynamically_ shaped `memref`, while the others use
    _static_ shapes.
  * The mask in the first two examples is "all true" (`vector.constant_mas
    [16]`), while it is "all false" (`vector.constant_mask [0]`) in the thir
    example.
  * The first and the third tests use `i32` elements, whereas the second uses
    `i8`.

That's not something I feel super strongly about. Usually its tricky to see the "optimal" solution on first iteration and I guess that's the case here.

* The mask in the first two examples is "all true" (`vector.constant_mas
[16]`), while it is "all false" (`vector.constant_mask [0]`) in the thir
example.
* The first and the third tests use `i32` elements, whereas the second uses

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key attributes should reflect what's important for the tested transform, in this case having different data types is good for completeness but they don't trigger any special/separate logic.
It would be sufficient to highlight one case that uses different data type with, for example, a suffix and omit data type for others.

I know that's not the intention behind this example but it can be misused to push for "enterprisification" (classic java gem: https://projects.haykranen.nl/java/) of naming with combinatorial explosion of all the little details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good points and I have come across this myself (without a good solution).

One specific example are Vector ops that can represent "scalars" as:

  • vector<1xf32> vs vector<f32> vs f32.

To differentiate tests, we started adding things like @name_from_vec0d_to_vec1d, @name_from_vec1d_to_vec0d ... It spirals very quickly.

I don't have a solution for this, but will definitely incorporate your suggestion to ... not forget about the common sense 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a short paragraph in this commit. Let me know what you think - do we need more?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great clause 👍
It conveys the spirit of these guidelines and defends against too pedantic rule lawyers.


When adding new tests, strive to follow these two key rules:

1. **Follow the existing naming and whitespace style.**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should the existing bad style (test_1, test_2, ...) prevent naming new test cases more reasonably?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, thanks! Let me add a dedicated paragraph on that.

Fix formatting, add paragraph on what to do when there's no (good) pre-exisiting style to follow.
@banach-space banach-space force-pushed the andrzej/add_gudelines_testing branch from 8d25ae8 to f362ae2 Compare February 20, 2025 11:21
@banach-space
Copy link
Contributor Author

Thank you for the reviews 🙏🏻 I've addressed most of your comments + added a section on "commenting"/"documenting".

Let me know what you think and I will happily iterate based on the feedback.

@dcaballe
Copy link

Thanks a lot! We definitely needed some specific guidelines here. Thanks for putting in the time!

I think it might be helpful to clarify the level of enforcement we expect for these guidelines and the degree of flexibility we should allow. I fully agree with the motivation points highlighted in the PR but I'm also worried that we make things painfully rigid. If we set the bar too high, there’s a risk that testing becomes tedious enough to potentially impact actual coverage. I’m also worried that some of the renaming/formatting aspects of the guidelines take the focus away in the code reviews from other aspects that should have more attention.

To alleviate some of this, esp. if we are going after a high level of enforcement, I think we should put some effort into automating some of the renaming/formatting aspects. For example, we could extend generate-test-checks.py (or implement a new tool) for MLIR so that it generates more meaningful CHECK variable names by default, derived from the dialect or op name, for example. That could greatly enhance consistency by default while minimizing manual effort.

I also feel consistency is open to interpretation. For example, if an existing test uses V_INPUT_0 as a CHECK variable name and a new test is added with a INPUT0 variable name, should we enforce a change? That name seems valid to me and we don’t enforce such a level of naming consistency on the C++ code. I think we may want to allow the same level of flexibility on the testing side.

Anyways, some food for thoughts and probably unpopular opinions I wanted to share 😊. Great writeup and ideas overall. Thanks!

@banach-space
Copy link
Contributor Author

Thank you for the generous feedback, Diego! These are some great points.

Before addressing specific comments, I see the introduction of new testing standards as a three-step process:

  1. Codify expectations (this PR).
  2. Endorse & embrace—let the community adapt and refine the guidelines.
  3. Enforce—once broadly adopted, introduce tooling to assist with enforcement.

This is merely Step 1. Right now, we have neither guidelines nor a clear place to document best practices.

I think it might be helpful to clarify the level of enforcement we expect for these guidelines and the degree of flexibility we should allow.

This might be tricky, and I'd rather leave it under-specified. Many LLVM guidelines are also loosely defined, and I think that’s fine.

I’m also worried that some of the renaming/formatting aspects of the guidelines take the focus away in the code reviews from other aspects that should have more attention.

Agreed - there needs to be a healthy balance. However, IMHO, not enough effort is put into future-proofing and maintaining our tests.

My hope is that these guidelines will streamline discussions and reduce time spent debating formats in reviews. Ultimately, this should free up time to focus on the actual code.

If we are going after a high level of enforcement, I think we should put some effort into automating some of the renaming/formatting aspects.

I don’t think we should automate this just yet.

For example, we could extend generate-test-checks.py (or implement a new tool) for MLIR so that it generates more meaningful CHECK variable names by default, derived from the dialect or op name.

I fully agree that better tooling would help, and this is something we should aim for in the future. However, there’s little point in improving tooling if we haven’t yet agreed on what we want to standardize.

  • First, we need to (1) codify best practices.
  • Once the community has (2) embraced these guidelines, we can explore ways to automate (3) enforcement.

I’ve considered extending generate-test-checks.py, but I believe we will always need human input to provide meaningful naming based on semantics.

Below is an automatically generated test check using today's version:

// Automatically generated - today’s version
// CHECK-LABEL:   func.func @bitcast_2d(
// CHECK-SAME:                          %[[VAL_0:[0-9]+|[a-zA-Z$._-][a-zA-Z0-9$._-]*]]: vector<2x4xi32>) -> vector<2x16xi8> {
// CHECK:           %[[VAL_1:.*]] = vector.bitcast %[[VAL_0]] : vector<2x4xi32> to vector<2x2xi64>
// CHECK:           %[[VAL_2:.*]] = vector.bitcast %[[VAL_1]] : vector<2x2xi64> to vector<2x16xi8>
// CHECK:           return %[[VAL_2]] : vector<2x16xi8>
func.func @bitcast_2d(%arg0: vector<2x4xi32>) -> vector<2x16xi8> {
  %0 = vector.bitcast %arg0 : vector<2x4xi32> to vector<2x2xi64>
  %1 = vector.bitcast %0 : vector<2x2xi64> to vector<2x16xi8>
  return %1 : vector<2x16xi8>
}

For a mild improvement, we could extend generate-test-checks.py to rename VAL_0 and VAL_1 to BITCAST_1 and BITCAST_2:

// Automatically generated - possible near-future version (mild improvement)
// CHECK-LABEL:   func.func @bitcast_2d(
// CHECK-SAME:                          %[[ARG_0]]: vector<2x4xi32>) -> vector<2x16xi8> {
// CHECK:           %[[BITCAST_1:.*]] = vector.bitcast %[[ARG_0]] : vector<2x4xi32> to vector<2x2xi64>
// CHECK:           %[[BITCAST_2.:*]] = vector.bitcast %[[BITCAST_1]] : vector<2x2xi64> to vector<2x16xi8>
// CHECK:           return %[[BITCAST_2]] : vector<2x16xi8>
func.func @bitcast_2d(%arg0: vector<2x4xi32>) -> vector<2x16xi8> {
  %0 = vector.bitcast %arg0 : vector<2x4xi32> to vector<2x2xi64>
  %1 = vector.bitcast %0 : vector<2x2xi64> to vector<2x16xi8>
  return %1 : vector<2x16xi8>
}

However, ideally, we’d want something context-aware, such as:

// Human-curated, context-aware naming
// CHECK-LABEL:   func.func @bitcast_2d(
// CHECK-SAME:                          %[[ARG_0]]: vector<2x4xi32>) -> vector<2x16xi8> {
// CHECK:           %[[UPCAST:.*]] = vector.bitcast %[[ARG_0]] : vector<2x4xi32> to vector<2x2xi64>
// CHECK:           %[[DOWNCAST.:*]] = vector.bitcast %[[UPCAST]] : vector<2x2xi64> to vector<2x16xi8>
// CHECK:           return %[[DOWNCAST]] : vector<2x16xi8>
func.func @bitcast_2d(%arg0: vector<2x4xi32>) -> vector<2x16xi8> {
  %0 = vector.bitcast %arg0 : vector<2x4xi32> to vector<2x2xi64>
  %1 = vector.bitcast %0 : vector<2x2xi64> to vector<2x16xi8>
  return %1 : vector<2x16xi8>
}

To achieve this level of naming automation, we’d likely need LSP-based tooling, which is beyond my current capacity to explore.

I also feel consistency is open to interpretation.

Yes, I deliberately left "consistency" under-specified. I feel it's something where we should trust our judgment. See also this comment from Jakub (my reply).

Ultimately, we could reduce it to a simple question:

  • "Are you re-using an existing style?"
    • If yes, then "You are maintaining a level of consistency, thank you!"
    • If not, then "Please provide rationale." (e.g. "There's no pre-exisitng style to follow.")
  • Most importantly, "Please be mindful of consistency when contributing tests 🙏🏻 "

If an existing test uses V_INPUT_0 as a CHECK variable name and a new test is added with INPUT0, should we enforce a change?

In this case, assuming V_INPUT_0 appears in every test in a given file, I would ask the contributor:


  • Why not keep things simple and consistent? Following the existing style makes it easier to compare test cases and identify edge cases. If you choose to diverge, please provide a rationale.

Ultimately, we should also trust our judgment - both as reviewers and contributors.


Final point

The actual guidelines/requirements boil down to:

  • Be consistent and, where possible, follow the existing style.
  • Document your tests using comments, meaningful function names, and descriptive MLIR/LIT variable names.

The examples I included in this PR are just that - examples. I don’t think the general principles are sufficient without concrete illustrations, but again, these are just suggestions to inspire contributors rather than rigid rules.

Thanks for considering this!

Fix typos/formatting, add paragraph on common sense
better, consider refactoring the file to adopt a single, consistent style —
this helps improve our overall testing quality.

This is also encouraged when the existing style leaves room for improvement.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Specify this more explicitly. Does it refer to following the existing style or refactoring?

Comment on lines +661 to +662
/// The permutation map was replaced with vector.transpose
// CHECK-NOT: permutation_map

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is also made by the second example. Could they be collapsed?

Also, while the previous 3 examples at least are driven by more detailed narrative, I feel these verbose examples are overkill for a message that is summarized with two bullet points (just to be clear - I mean smaller example is more readable, not adding more description ;) ).

Or perhaps the key parts could use some emphasis (bold etc.). Or the actual test details could be omitted for documentation clarity.
Blob of IR triggers tl;dr before I filter out the message.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for the convolution example.

While we generally document why, documenting what is valid and encouraged in
cases where:

* The he test generates long or complex output

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The he test generates long or complex output
* The test generates long or complex output

Note that while the comments document what is happening (e.g., "Write the
result back in one shot"), some variables — like `w` and `kw` — remain
"enigmatic" and are not explicitly explained. This might leave us
second-guessing their meaning at first. However, that’s intentional—their

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: white spaces around dashes

When adding new tests, strive to follow these two key rules:

1. **Follow the existing naming and whitespace style.**
- This applies when modifying existing test files.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/when modifying existing test files/when modifying existing test files that already contain a number of tests following a particular convention that likely fits in that context/

2. **Consistently document the edge case being tested.**
- Clearly state what makes this test unique and how it complements other
similar tests.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Orthogonal tests
    Do not test the same things again and again. When writing tests, decide what are the categories C1,C2, you want to test (e.g. number of loops, data types). That gives a natural naming loop_depth2_int. Struggling with good naming is sometimes reflective of absence of regular pattern in the set of tests being written.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Names and Comments
    Test names should reflect what is being tested. The why shouldnt be attempted to code in the name and unless obvious from code/context, must be added as comment but avoiding over-explainng..

* Use high-level block comments to describe patterns being tested.
* Think about maintainability - comments should help future developers
understand tests at a glance.
* Comments should assist, not replace reading the code—avoid over-explaining.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/code—avoid over-explaining./code. Avoid over-explaining.

@javedabsar1
Copy link

Thanks @banach-space for doing this work. much appreciated.

@dcaballe
Copy link

Thanks, Andrzej. Yes, I mostly agree with all that you said! I think the example you provided above illustrates my specific concern:

For a mild improvement, we could extend generate-test-checks.py to rename VAL_0 and VAL_1 to BITCAST_1 and BITCAST_2:

// Automatically generated - possible near-future version (mild improvement)
// CHECK-LABEL: func.func @bitcast_2d(
// CHECK-SAME: %[[ARG_0]]: vector<2x4xi32>) -> vector<2x16xi8> {
// CHECK: %[[BITCAST_1:.]] = vector.bitcast %[[ARG_0]] : vector<2x4xi32> to vector<2x2xi64>
// CHECK: %[[BITCAST_2.:
]] = vector.bitcast %[[BITCAST_1]] : vector<2x2xi64> to vector<2x16xi8>
// CHECK: return %[[BITCAST_2]] : vector<2x16xi8>
func.func @bitcast_2d(%arg0: vector<2x4xi32>) -> vector<2x16xi8> {
%0 = vector.bitcast %arg0 : vector<2x4xi32> to vector<2x2xi64>
%1 = vector.bitcast %0 : vector<2x2xi64> to vector<2x16xi8>
return %1 : vector<2x16xi8>
}
However, ideally, we’d want something context-aware, such as:

// Human-curated, context-aware naming
// CHECK-LABEL: func.func @bitcast_2d(
// CHECK-SAME: %[[ARG_0]]: vector<2x4xi32>) -> vector<2x16xi8> {
// CHECK: %[[UPCAST:.]] = vector.bitcast %[[ARG_0]] : vector<2x4xi32> to vector<2x2xi64>
// CHECK: %[[DOWNCAST.:
]] = vector.bitcast %[[UPCAST]] : vector<2x2xi64> to vector<2x16xi8>
// CHECK: return %[[DOWNCAST]] : vector<2x16xi8>
func.func @bitcast_2d(%arg0: vector<2x4xi32>) -> vector<2x16xi8> {
%0 = vector.bitcast %arg0 : vector<2x4xi32> to vector<2x2xi64>
%1 = vector.bitcast %0 : vector<2x2xi64> to vector<2x16xi8>
return %1 : vector<2x16xi8>
}

Someone might find that BITCAST_1 is perfect, while someone else might find that it’s not capturing enough information. Or the other way around! I’d personally argue that for a test with three CHECK rules, even VAL_1 or A would be just fine. This is the kind of arguments that concerns me. As you well said, these guidelines are subject to interpretation and opinion so I think it's important that we somewhat define where the "acceptable band" is.

Regarding the automation, honestly, I think having a tool that generates variables based on the dialect/op names would be a fantastic starting point and help define where the acceptable bar is. Of course, I'm not saying you should implement such a tool :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants