Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve Spelling rule by using word boundaries #900

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

emteelb
Copy link
Contributor

@emteelb emteelb commented Nov 12, 2024

By using regex word boundary (\b) delimiters, the spelling rule applies to individual words rather than a word that might contain the regex filter. For example, "\b[cC]he\b" will match only "che" and "Che" rather than a regex filter without word boundary delimiters, for example, "[cC]he" that would match misspelled words that contain the regex, such as "aache" or "chemitsry".

Closes #894

By using regex word boundary (\b) delimiters, the spelling rule
applies to individual words rather than a word that might contain the
regex filter. For example, "\b[cC]he\b" will match only "che" and "Che"
rather than a regex filter without word boundary delimiters, for
example, "[cC]he" that would match misspelled words that contain the
regex, such as "aache" or "chemitsry".
Copy link

github-actions bot commented Nov 12, 2024

Deploy PR Preview failed.

@emteelb
Copy link
Contributor Author

emteelb commented Nov 12, 2024

This is a "heavy-handed" approach where word boundary delimiters are used for every filter regular expression, for consistency, even those filters that might be unambiguous and unlikely to be contained within other words.

@emteelb emteelb changed the title improve Spelling rules by using word boundaries improve Spelling rule by using word boundaries Nov 12, 2024
@ccoVeille
Copy link

ccoVeille commented Nov 12, 2024

I would like to drop my thought on this

Some rules comes with nonwords flag.

https://vale.sh/docs/topics/styles/

I don't remember whether spelling rule supports it.

nonword:true is about removing the automatic \b that are added around each words with the default value nonword:false.

So if you use nonword:true, you have to use the \b to surround word, unless you want to catch words with a prefix or suffix, so you use only a \b.

It is also useful when you want to catch multiple words \bun\-[a-z]+\b or punctuation such as \bvs[\., ]\b

So adding \b somewhere can be useless, if you are using nonword:true AND there are some rules where you don't use \b on both side. Otherwise, you can use the default nonword:false and avoid repeating the \b everywhere

Let's go back to this PR, I don't see a need to add \b on both side. Also, I don't think it works as expected, especially because I don't see a nonword:true flag

Finally, the spelling rules doesn't invent things, if you add foo, it won't allow foobar. I mean if you add smally (just an example that is not in the dictionary), you want allow eyesmally or smallyfish

So I don't see what you are trying to fix, and things are getting very complicated...

@emteelb
Copy link
Contributor Author

emteelb commented Nov 13, 2024

I did not know about the nonword key for style rules. Thanks for pointing that out. I notice it's shown in the link you provided but when I search Vale documentation for more information, there are no results. If you know where this key is documented, lmk.

Try running the following with styles/RedHat/Spelling.yml as it is now in the main branch:

vale "This is a mistache."

With both RedHat.Spelling = YES and Vale.Spelling = YES in my .vale.ini configuration file, my results are:

 stdin.txt
 1:11  error  Did you really mean             Vale.Spelling
              'mistache'?                                  

✖ 1 error, 0 warnings and 0 suggestions in stdin.

Neither adding nonword: true nor nonword: false to the RH spelling style rule changes this result so I gather that the spelling style rule either does not support this key or else the key does something different that does not change the result.

If I change the "[cC]he" filter in the style rule to match what I have in this PR ("\b[cC]he\b"), and enter the previous vale command, my result is:

 stdin.txt
 1:11  warning  Verify the word 'mistache'. It  RedHat.Spelling
                is not in the American English                 
                spelling dictionary used by                    
                Vale.                                          
 1:11  error    Did you really mean             Vale.Spelling  
                'mistache'?                                    

✖ 1 error, 1 warning and 0 suggestions in stdin.

as would be expected.

Granted, "mistache" is an unlikely typo but I originally stumbled upon this issue when reviewing some writing where someone had a simple transposition mistake, "subcommnad", and because of the RH spelling style rule filter ("[Ss]u"), it wasn't caught (I normally enable only one spelling style rule, RH in this case, and disable the Vale spelling style rule).

As I said in a comment above, I took a heavy-handed approach in this PR by delimiting all the filters with word boundaries. I am open to a more selective approach and that might be something the RH team considers. This was just the easiest approach to be certain not to miss any potential word partials.

@ccoVeille
Copy link

thanks for you reply and tests.

Let me ask for help.

@jdkato what would you recommend here?

What do you think about what I wrote in my previous post?

@jdkato
Copy link

jdkato commented Jan 3, 2025

I think this PR solves the problem outlined in #894 sufficiently.

However, this isn't how I personally would go about handling spelling exceptions.

My general practices are:

  1. Filters should be used for ignoring large classes of words that you expect to be false positives. See the built-in ones for examples.

  2. Case-insensitive exceptions should be in a dictionary.

  3. Case-sensitive exceptions should be in a vocabulary, providing the added benefit of ensuring they're ignored during spell check but always capitalized correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spelling.yml filters word partials leading to misspellings not being flagged
3 participants