Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Initial RFC for sharding #112

Closed
wants to merge 4 commits into from

Conversation

balavishnuvj
Copy link

Summary

This RFC proposes adding sharding support to the ESLint package. Sharding is a technique used to divide a large workload into smaller, more manageable parts that can be processed concurrently. By introducing sharding in ESLint, we aim to improve linting performance for large codebases and reduce the overall execution time of linting tasks.

Related Issues

eslint/eslint#16726

@eslint-github-bot
Copy link

Hi @balavishnuvj!, thanks for the Pull Request

The first commit message isn't properly formatted. We ask that you update the message to match this format, as we use it to generate changelogs and automate releases.

  • The commit message tag wasn't recognized. Did you mean "docs", "fix", or "feat"?
  • There should be a space following the initial tag and colon, for example 'feat: Message'.
  • The first letter of the tag should be in lowercase

To Fix: You can fix this problem by running git commit --amend, editing your commit message, and then running git push -f to update this pull request.

Read more about contributing to ESLint here

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jun 1, 2023

CLA Missing ID CLA Not Signed

@balavishnuvj balavishnuvj changed the title Initial RFC for sharding feat: Initial RFC for sharding Jun 1, 2023
@eslint-github-bot
Copy link

Hi @balavishnuvj!, thanks for the Pull Request

The first commit message isn't properly formatted. We ask that you update the message to match this format, as we use it to generate changelogs and automate releases.

  • The commit message tag wasn't recognized. Did you mean "docs", "fix", or "feat"?
  • There should be a space following the initial tag and colon, for example 'feat: Message'.
  • The first letter of the tag should be in lowercase

To Fix: You can fix this problem by running git commit --amend, editing your commit message, and then running git push -f to update this pull request.

Read more about contributing to ESLint here

@bradzacher
Copy link

bradzacher commented Jun 1, 2023

One important thing to note is that if we rely solely on ESLint internals for sharding then we will very easily create situations where type-aware linting will be a lot more expensive than it should have been.

For example if a shard contains 1 file from a project, then type-aware linting would mor eor less need to redo all of the type work performed in another shard.

This behaviour would not be obvious to end users (as sharding would likely be an opaque action) which would lead to a lot of wasted time and resources.

Ideally some sharding solution should be able to ask the language provider how to best shard the codebase so that correct optimisations can be made.

For example in TS projects it's logical to shard a lint run by project so that the files contained in a tsconfig is entirely contained within one shard.

This is something we've discussed before in the context of language plugins as a whole in the ecosystem.

Note that this situation doesn't just apply to typescript-eslint, but also to eslint-plugin-import - which will do parsing out-of-band and cache that information for the lint run. So sharding the middle of a dependency graph could lead to wasted time and work.

@voxpelli
Copy link

voxpelli commented Jun 1, 2023

@bradzacher Adding an option for a plug-in and/or plug-in rule to mark itself as undesirable / unsupported in sharding could help mitigate the effect of that in the short term, until a solution that enable them to also be shared gets added?

@bradzacher
Copy link

bradzacher commented Jun 1, 2023

typescript-eslint and eslint-plugin-import are used by something like 70% of the eslint community - which really, really limits the impact of a feature that lands without design consideration for those parts of the ecosystem.

I personally don't think this is something that should ship into ESLint unless there is a solution for those usecases given how big of a part of the ecosystem that they are.

The reason being that failure to plan for this now could mean we design ourselves into a corner and creat a feature that's too hard to retrofit to support them without major changes.

I'm not saying that v1 of this feature needs to explicitly support these cases - but IMO this RFC should at least show serious consideration into how they might be supported in order to ensure that a v2+ can be built without major breaking changes.

I'm also not a TSC member, I'm just a maintainer of typescript-eslint - so what I say is just what I'd love to see, nothing more.

@voxpelli
Copy link

voxpelli commented Jun 1, 2023

@bradzacher Though not all of eg typescript-eslint requires it to read a full project and it already warns of the performance impact of enabling such rules?

By including parserOptions.project in your config, you incur the performance penalty of asking TypeScript to do a build of your project before ESLint can do its linting.

And its recommended config does not include any such rules?


On the main topic – I'm not sure whether sharding by file or by rules is the proper way to go to speed things up.

As @bradzacher points out, some rules are easier to run in the same process and have it share the same caches. For them it would make more sense to have them have a separate process to other more light weight rules and thus enable them to be processed in parallel?

I believe that was brought up as an alternative in #42, which essentially suggested same thing thing as this RFC and which resulted in eg. eslint/eslint#13525 and eslint/eslint#14139 before being closed.

@bradzacher
Copy link

You're correct - we don't recommend type-aware rules by default because there is performance impact to using type information and for many users their codebase is large enough that this impact is too great for them.

But we do recommend people use type-aware rules where possible because they are so powerful and cover so much more. Which is why we also provide a separate config, recommended-requiring-type-checking, which is our recommended set for those that want type-aware linting.

Not all of the typescript-eslint community uses type-aware linting, for sure, just like not all of the eslint-plugin-import community uses the rules that parse dependencies - it's sadly impossible to get exact numbers on those subsets.


Sharding by rule is dangerous, in a way, because there are rules that depend on other rules to function.

For example in react codebases the core no-unused-vars lint rule will not work correctly without react/jsx-uses-vars and react/jsx-uses-react from eslint-plugin-react. Those rules call context.markVariableAsUsed which in turn impacts what no-unused-vars reports!

Sharding by rule would need to take these implicit dependencies into account.

Additionally sharding by rule doesn't solve the duplicate work problem. For example typescript-eslint type information is calculated at parse-time, not at rule time and is thus shared between rules. Similarly eslint-plugin-import dependency data is calculated at lint time, but is cached and shared between rules. So in both cases sharding by rules is explicitly worse than sharding by files!

It is possible that you could have a plugin declare "expensive" rules or "related" rules to have them sharded together, though in future we are considering building "optionally type-aware rules" for which the user can opt-in to stricter behaviour for specific rules at config time.

Which is why I say that sharding by file, but with information from the "language plugin" is the best way to think about this, IMO. ESLint can supply a default implementation for the sharding algorithm (eg for the "JS language") which boils down to "split into N shards naively using a deterministic algorithm", and typescript-eslint could specify a new implementation which shards around projects instead. If they wanted, eslint-plugin-import could provide an override implementation which is dependency-tree aware.

@nzakas nzakas added the Initial Commenting This RFC is in the initial feedback stage label Jun 1, 2023
@nzakas
Copy link
Member

nzakas commented Jun 1, 2023

@balavishnuvj thanks for taking the time to put this together, it's a good explainer of why sharding could be helpful but it doesn't explain how the sharding would be implemented. The purpose of the RFC is to put together a tech spec -- you should be investigating how ESLint works and how to implement sharding, not just presenting arguments as to why sharding is a good idea. Without the implementation details, we just don't have enough information to provide any real feedback.

This is something we'd like to investigate, but unless someone can suggest a possible implementation, we really can't go any further.


Q: Will enabling sharding affect the linting results or introduce any inconsistencies?

A: Enabling sharding should not affect the linting results or introduce any inconsistencies. Sharding is designed to distribute the workload and execute the linting process independently on each shard. The aggregated results should provide a unified view of the linting issues across all shards, ensuring consistent analysis.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right that this won't affect the current analysis. That said, it can limit future kinds of analysis like multi-file linting (which I've written about here, specifically this section), where sharding could make this impossible or at best become a hindrance that would require new solutions (like opting out some rules out of the sharding strategy).
You could end up with different results when turning on/off sharding, which would be pretty bad.
That said, I don't remember whether ESLint aims to implement this one day, but it's like eslint-plugin-import does in a round-about way.

@balavishnuvj
Copy link
Author

@nzakas thanks. I have shared this to get an initial idea about the feature and to verify if I have missed any part. Will add implementation details over the next week.

@balavishnuvj
Copy link
Author

have added some implementation detail. Can create a draft PR to help with the implementation details.

designs/2023-sharding/README.md Outdated Show resolved Hide resolved
designs/2023-sharding/README.md Outdated Show resolved Hide resolved

- **Resource Consumption**: Sharding involves launching multiple processes/threads to handle each shard concurrently. This can result in increased resource consumption, such as CPU and memory usage, particularly when dealing with a large number of shards or codebases.

- **Potential Overhead**: The overhead of shard coordination and result aggregation may impact the overall performance gain achieved through parallel execution. Careful optimization and benchmarking should be performed to ensure that the benefits of sharding outweigh the associated overhead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overhead is definitely a concern. As @bradzacher calls out, there could be situations where sharding unexpectedly reduces performance. And given overhead costs, sharding might only be worthwhile in codebases of a large-enough size. How would the user reasonably know when it makes sense to enable sharding? Can/should users realistically be expected to perform "careful optimization and benchmarking" on their own to decide when to use the feature?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The significant overhead would be to figure out the sharding buckets. But for large enough projects, it would be negligible.

How would the user reasonably know when it makes sense to enable sharding? Can/should users realistically be expected to perform "careful optimization and benchmarking" on their own to decide when to use the feature?

Sharding is an optimization feature, and it is essential for users to exercise diligence when deciding to enable it. While we can provide guidance and recommendations, it is ultimately up to the users to assess their specific use cases and evaluate whether sharding is necessary and beneficial for their codebase.

Users should consider a few factors like Codebase Size, Performance Issues and Resource Availability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guidelines and recommendations we provide should include example scenarios of when it makes sense to use sharding, when it doesn't make sense, and what kind of performance improvements can be expected. For example, we could include results from real-world codebases like "a codebase of 100k lines across 1k JS files saw a 20% speedup in lint running time from 5 min to 4 min when using sharding" (completely made-up example).

In fact, I think the RFC should test and report the real-world performance impacts (likely on some popular, large, open source repos), in order to justify implementing the sharding feature in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to be too prescriptive about this. In reality, people will likely try it out and determine if there's any benefit to their current process rather than going by any guidelines we publish. As with most of ESLint, I think we can provide the capability and let people try it, use it, or discard it depending on their individual results.

In fact, I think the RFC should test and report the real-world performance impacts (likely on some popular, large, open source repos), in order to justify implementing the sharding feature in the first place.

I disagree. Let's not move the goalposts here. We should be able to come up with scenarios where we reasonably expect sharding to improve performance and also scenarios where we believe it will decrease performance. Building out a full prototype and running it on random projects doesn't give us useful information as most tools would need to be tuned over time to make this work in individual projects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please see my earlier comment about sharding vs. parallel linting. It seems like we are combining those in this RFC, but that's not what the original issue was about.

eslint --shard=1/3
```

- **Workload Division**: Analyze the input codebase and divide it into smaller units of work called shards. The division can be based on files, directories, or any other logical grouping that allows for parallel processing. Each shard should be self-contained and independent of the others. For example, larger files can be distributed across shards to balance the workload. We could also expose api to get custom division statergy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, @bradzacher mentioned:

Sharding by rule is dangerous, in a way, because there are rules that depend on other rules to function.

For example in react codebases the core no-unused-vars lint rule will not work correctly without react/jsx-uses-vars and react/jsx-uses-react from eslint-plugin-react. Those rules call context.markVariableAsUsed which in turn impacts what no-unused-vars reports!

Sharding by rule would need to take these implicit dependencies into account.

To touch on this a bit more, there are various ways rules could be using shared / global state, as discussed in this thread. This behavior might not be encouraged or officially supported, but just be aware that sharding even a single rule could break things if the rule gathers up state while running across files. It would be good to call this possibility out, whether rules are allowed to do this generally or only through a supported API like markVariableAsUsed, and how sharding could impact such rules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to provide some details about what other tools do that have sharding. How does Jest decide which files to use in a shard, for example?

@nzakas
Copy link
Member

nzakas commented Jun 14, 2023

One thing I want to highlight for everyone: I don't think we should be dissuaded from working through this RFC because there are complexities involved. As with many ESLint features, people ultimately use the ones that work best for them and discard the ones that don't. Even the note about plugins potentially wanting to indicate how to shard is a bit of a distraction at this point. Once that logic is codified, it can always move from the core out into a plugin for modification. What's more interesting at this moment is how sharding as a concept applies to ESLint (or doesn't).

Copy link
Member

@nzakas nzakas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good first step regarding implementation. I think it would be helpful to separate out sharding from parallel linting and also to reference how other tools are already using sharding. That would help ground this RFC and help us better evaluate it.


The implementation of sharding in the ESLint package involves the following steps:

- **Configuration**: Introduce a new configuration option or flag in the ESLint configuration file to enable sharding. This option can specify the number of shards or the desired granularity of the shards.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will have to be done at the CLI level. It's not possible to have it in the config file.

eslint --shard=1/3
```

- **Workload Division**: Analyze the input codebase and divide it into smaller units of work called shards. The division can be based on files, directories, or any other logical grouping that allows for parallel processing. Each shard should be self-contained and independent of the others. For example, larger files can be distributed across shards to balance the workload. We could also expose api to get custom division statergy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to provide some details about what other tools do that have sharding. How does Jest decide which files to use in a shard, for example?


- **Workload Division**: Analyze the input codebase and divide it into smaller units of work called shards. The division can be based on files, directories, or any other logical grouping that allows for parallel processing. Each shard should be self-contained and independent of the others. For example, larger files can be distributed across shards to balance the workload. We could also expose api to get custom division statergy.

- **Parallel Execution**: Launch multiple processes or threads to handle each shard concurrently. Each process/thread should execute the linting process independently on its assigned shard.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original issue made it a point to separate sharding from parallel linting.

For this RFC, we should focus on sharding, which involves running ESLint on a subset of the found files rather than the entire set.


- **Resource Consumption**: Sharding involves launching multiple processes/threads to handle each shard concurrently. This can result in increased resource consumption, such as CPU and memory usage, particularly when dealing with a large number of shards or codebases.

- **Potential Overhead**: The overhead of shard coordination and result aggregation may impact the overall performance gain achieved through parallel execution. Careful optimization and benchmarking should be performed to ensure that the benefits of sharding outweigh the associated overhead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please see my earlier comment about sharding vs. parallel linting. It seems like we are combining those in this RFC, but that's not what the original issue was about.

## Detailed Design

1. Get the files in a deterministic order.
in `lib/cli.js`, we can get all the files. Update the `executeOnFiles` function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean cli-engine.js? If so, this file is deprecated and will be removed, so changes need to be made in flat-eslint.js instead.

}
```

2. Divide the files into shards. The number of shards is determined by the `--shard` option.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, it would be helpful to point to what other tools do to determine which files are in a shard.

@nzakas
Copy link
Member

nzakas commented Oct 9, 2023

@balavishnuvj are you going to follow up on this?

@nzakas
Copy link
Member

nzakas commented Nov 13, 2023

No response, so closing.

@nzakas nzakas closed this Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Initial Commenting This RFC is in the initial feedback stage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants