-
-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Initial RFC for sharding #112
Conversation
Hi @balavishnuvj!, thanks for the Pull Request The first commit message isn't properly formatted. We ask that you update the message to match this format, as we use it to generate changelogs and automate releases.
To Fix: You can fix this problem by running Read more about contributing to ESLint here |
|
Hi @balavishnuvj!, thanks for the Pull Request The first commit message isn't properly formatted. We ask that you update the message to match this format, as we use it to generate changelogs and automate releases.
To Fix: You can fix this problem by running Read more about contributing to ESLint here |
559fd24
to
fcb2ab0
Compare
One important thing to note is that if we rely solely on ESLint internals for sharding then we will very easily create situations where type-aware linting will be a lot more expensive than it should have been. For example if a shard contains 1 file from a project, then type-aware linting would mor eor less need to redo all of the type work performed in another shard. This behaviour would not be obvious to end users (as sharding would likely be an opaque action) which would lead to a lot of wasted time and resources. Ideally some sharding solution should be able to ask the language provider how to best shard the codebase so that correct optimisations can be made. For example in TS projects it's logical to shard a lint run by project so that the files contained in a tsconfig is entirely contained within one shard. This is something we've discussed before in the context of language plugins as a whole in the ecosystem. Note that this situation doesn't just apply to typescript-eslint, but also to eslint-plugin-import - which will do parsing out-of-band and cache that information for the lint run. So sharding the middle of a dependency graph could lead to wasted time and work. |
@bradzacher Adding an option for a plug-in and/or plug-in rule to mark itself as undesirable / unsupported in sharding could help mitigate the effect of that in the short term, until a solution that enable them to also be shared gets added? |
typescript-eslint and eslint-plugin-import are used by something like 70% of the eslint community - which really, really limits the impact of a feature that lands without design consideration for those parts of the ecosystem. I personally don't think this is something that should ship into ESLint unless there is a solution for those usecases given how big of a part of the ecosystem that they are. The reason being that failure to plan for this now could mean we design ourselves into a corner and creat a feature that's too hard to retrofit to support them without major changes. I'm not saying that v1 of this feature needs to explicitly support these cases - but IMO this RFC should at least show serious consideration into how they might be supported in order to ensure that a v2+ can be built without major breaking changes. I'm also not a TSC member, I'm just a maintainer of typescript-eslint - so what I say is just what I'd love to see, nothing more. |
@bradzacher Though not all of eg
And its On the main topic – I'm not sure whether sharding by file or by rules is the proper way to go to speed things up. As @bradzacher points out, some rules are easier to run in the same process and have it share the same caches. For them it would make more sense to have them have a separate process to other more light weight rules and thus enable them to be processed in parallel? I believe that was brought up as an alternative in #42, which essentially suggested same thing thing as this RFC and which resulted in eg. eslint/eslint#13525 and eslint/eslint#14139 before being closed. |
You're correct - we don't recommend type-aware rules by default because there is performance impact to using type information and for many users their codebase is large enough that this impact is too great for them. But we do recommend people use type-aware rules where possible because they are so powerful and cover so much more. Which is why we also provide a separate config, Not all of the typescript-eslint community uses type-aware linting, for sure, just like not all of the eslint-plugin-import community uses the rules that parse dependencies - it's sadly impossible to get exact numbers on those subsets. Sharding by rule is dangerous, in a way, because there are rules that depend on other rules to function. For example in react codebases the core Sharding by rule would need to take these implicit dependencies into account. Additionally sharding by rule doesn't solve the duplicate work problem. For example typescript-eslint type information is calculated at parse-time, not at rule time and is thus shared between rules. Similarly eslint-plugin-import dependency data is calculated at lint time, but is cached and shared between rules. So in both cases sharding by rules is explicitly worse than sharding by files! It is possible that you could have a plugin declare "expensive" rules or "related" rules to have them sharded together, though in future we are considering building "optionally type-aware rules" for which the user can opt-in to stricter behaviour for specific rules at config time. Which is why I say that sharding by file, but with information from the "language plugin" is the best way to think about this, IMO. ESLint can supply a default implementation for the sharding algorithm (eg for the "JS language") which boils down to "split into |
@balavishnuvj thanks for taking the time to put this together, it's a good explainer of why sharding could be helpful but it doesn't explain how the sharding would be implemented. The purpose of the RFC is to put together a tech spec -- you should be investigating how ESLint works and how to implement sharding, not just presenting arguments as to why sharding is a good idea. Without the implementation details, we just don't have enough information to provide any real feedback. This is something we'd like to investigate, but unless someone can suggest a possible implementation, we really can't go any further. |
|
||
Q: Will enabling sharding affect the linting results or introduce any inconsistencies? | ||
|
||
A: Enabling sharding should not affect the linting results or introduce any inconsistencies. Sharding is designed to distribute the workload and execute the linting process independently on each shard. The aggregated results should provide a unified view of the linting issues across all shards, ensuring consistent analysis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right that this won't affect the current analysis. That said, it can limit future kinds of analysis like multi-file linting (which I've written about here, specifically this section), where sharding could make this impossible or at best become a hindrance that would require new solutions (like opting out some rules out of the sharding strategy).
You could end up with different results when turning on/off sharding, which would be pretty bad.
That said, I don't remember whether ESLint aims to implement this one day, but it's like eslint-plugin-import does in a round-about way.
@nzakas thanks. I have shared this to get an initial idea about the feature and to verify if I have missed any part. Will add implementation details over the next week. |
have added some implementation detail. Can create a draft PR to help with the implementation details. |
c7468f7
to
87a4879
Compare
|
||
- **Resource Consumption**: Sharding involves launching multiple processes/threads to handle each shard concurrently. This can result in increased resource consumption, such as CPU and memory usage, particularly when dealing with a large number of shards or codebases. | ||
|
||
- **Potential Overhead**: The overhead of shard coordination and result aggregation may impact the overall performance gain achieved through parallel execution. Careful optimization and benchmarking should be performed to ensure that the benefits of sharding outweigh the associated overhead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overhead is definitely a concern. As @bradzacher calls out, there could be situations where sharding unexpectedly reduces performance. And given overhead costs, sharding might only be worthwhile in codebases of a large-enough size. How would the user reasonably know when it makes sense to enable sharding? Can/should users realistically be expected to perform "careful optimization and benchmarking" on their own to decide when to use the feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The significant overhead would be to figure out the sharding buckets. But for large enough projects, it would be negligible.
How would the user reasonably know when it makes sense to enable sharding? Can/should users realistically be expected to perform "careful optimization and benchmarking" on their own to decide when to use the feature?
Sharding is an optimization feature, and it is essential for users to exercise diligence when deciding to enable it. While we can provide guidance and recommendations, it is ultimately up to the users to assess their specific use cases and evaluate whether sharding is necessary and beneficial for their codebase.
Users should consider a few factors like Codebase Size, Performance Issues and Resource Availability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guidelines and recommendations we provide should include example scenarios of when it makes sense to use sharding, when it doesn't make sense, and what kind of performance improvements can be expected. For example, we could include results from real-world codebases like "a codebase of 100k lines across 1k JS files saw a 20% speedup in lint running time from 5 min to 4 min when using sharding" (completely made-up example).
In fact, I think the RFC should test and report the real-world performance impacts (likely on some popular, large, open source repos), in order to justify implementing the sharding feature in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to be too prescriptive about this. In reality, people will likely try it out and determine if there's any benefit to their current process rather than going by any guidelines we publish. As with most of ESLint, I think we can provide the capability and let people try it, use it, or discard it depending on their individual results.
In fact, I think the RFC should test and report the real-world performance impacts (likely on some popular, large, open source repos), in order to justify implementing the sharding feature in the first place.
I disagree. Let's not move the goalposts here. We should be able to come up with scenarios where we reasonably expect sharding to improve performance and also scenarios where we believe it will decrease performance. Building out a full prototype and running it on random projects doesn't give us useful information as most tools would need to be tuned over time to make this work in individual projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please see my earlier comment about sharding vs. parallel linting. It seems like we are combining those in this RFC, but that's not what the original issue was about.
eslint --shard=1/3 | ||
``` | ||
|
||
- **Workload Division**: Analyze the input codebase and divide it into smaller units of work called shards. The division can be based on files, directories, or any other logical grouping that allows for parallel processing. Each shard should be self-contained and independent of the others. For example, larger files can be distributed across shards to balance the workload. We could also expose api to get custom division statergy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier, @bradzacher mentioned:
Sharding by rule is dangerous, in a way, because there are rules that depend on other rules to function.
For example in react codebases the core no-unused-vars lint rule will not work correctly without react/jsx-uses-vars and react/jsx-uses-react from eslint-plugin-react. Those rules call context.markVariableAsUsed which in turn impacts what no-unused-vars reports!
Sharding by rule would need to take these implicit dependencies into account.
To touch on this a bit more, there are various ways rules could be using shared / global state, as discussed in this thread. This behavior might not be encouraged or officially supported, but just be aware that sharding even a single rule could break things if the rule gathers up state while running across files. It would be good to call this possibility out, whether rules are allowed to do this generally or only through a supported API like markVariableAsUsed
, and how sharding could impact such rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to provide some details about what other tools do that have sharding. How does Jest decide which files to use in a shard, for example?
Co-authored-by: Bryan Mishkin <[email protected]>
Co-authored-by: Bryan Mishkin <[email protected]>
One thing I want to highlight for everyone: I don't think we should be dissuaded from working through this RFC because there are complexities involved. As with many ESLint features, people ultimately use the ones that work best for them and discard the ones that don't. Even the note about plugins potentially wanting to indicate how to shard is a bit of a distraction at this point. Once that logic is codified, it can always move from the core out into a plugin for modification. What's more interesting at this moment is how sharding as a concept applies to ESLint (or doesn't). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good first step regarding implementation. I think it would be helpful to separate out sharding from parallel linting and also to reference how other tools are already using sharding. That would help ground this RFC and help us better evaluate it.
|
||
The implementation of sharding in the ESLint package involves the following steps: | ||
|
||
- **Configuration**: Introduce a new configuration option or flag in the ESLint configuration file to enable sharding. This option can specify the number of shards or the desired granularity of the shards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will have to be done at the CLI level. It's not possible to have it in the config file.
eslint --shard=1/3 | ||
``` | ||
|
||
- **Workload Division**: Analyze the input codebase and divide it into smaller units of work called shards. The division can be based on files, directories, or any other logical grouping that allows for parallel processing. Each shard should be self-contained and independent of the others. For example, larger files can be distributed across shards to balance the workload. We could also expose api to get custom division statergy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to provide some details about what other tools do that have sharding. How does Jest decide which files to use in a shard, for example?
|
||
- **Workload Division**: Analyze the input codebase and divide it into smaller units of work called shards. The division can be based on files, directories, or any other logical grouping that allows for parallel processing. Each shard should be self-contained and independent of the others. For example, larger files can be distributed across shards to balance the workload. We could also expose api to get custom division statergy. | ||
|
||
- **Parallel Execution**: Launch multiple processes or threads to handle each shard concurrently. Each process/thread should execute the linting process independently on its assigned shard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original issue made it a point to separate sharding from parallel linting.
For this RFC, we should focus on sharding, which involves running ESLint on a subset of the found files rather than the entire set.
|
||
- **Resource Consumption**: Sharding involves launching multiple processes/threads to handle each shard concurrently. This can result in increased resource consumption, such as CPU and memory usage, particularly when dealing with a large number of shards or codebases. | ||
|
||
- **Potential Overhead**: The overhead of shard coordination and result aggregation may impact the overall performance gain achieved through parallel execution. Careful optimization and benchmarking should be performed to ensure that the benefits of sharding outweigh the associated overhead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please see my earlier comment about sharding vs. parallel linting. It seems like we are combining those in this RFC, but that's not what the original issue was about.
## Detailed Design | ||
|
||
1. Get the files in a deterministic order. | ||
in `lib/cli.js`, we can get all the files. Update the `executeOnFiles` function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean cli-engine.js
? If so, this file is deprecated and will be removed, so changes need to be made in flat-eslint.js
instead.
} | ||
``` | ||
|
||
2. Divide the files into shards. The number of shards is determined by the `--shard` option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, it would be helpful to point to what other tools do to determine which files are in a shard.
@balavishnuvj are you going to follow up on this? |
No response, so closing. |
Summary
This RFC proposes adding sharding support to the ESLint package. Sharding is a technique used to divide a large workload into smaller, more manageable parts that can be processed concurrently. By introducing sharding in ESLint, we aim to improve linting performance for large codebases and reduce the overall execution time of linting tasks.
Related Issues
eslint/eslint#16726