-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Add linkcheck_case_sensitive configuration option #14046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add linkcheck_case_sensitive configuration option #14046
Conversation
|
Hi @FazeelUsmani - thank you for developing and describing this pull request. I have a concern that enabling the option reduces the precision of other hyperlinks that are checked. Could you explain the use case where it would be easier for a documentation project to enable this option by editing the |
|
That’s a good point, @jayaddison. |
|
@FazeelUsmani got it, understood. As often happens, I had a misunderstanding to begin with - you are saying that this only affects whether case-adjusted response URLs are considered to be Let me think about this a little more; I do understand the value in this now, but am wary of (and trying to think of) any problem side-effects. |
|
(also, thank you for the explanation) |
|
Separately: I do think that we should probably isolate the |
|
Hmm.. makes sense. I can refactor this into two separate options: This would give users more granular control. Most users would likely want |
|
Two options seems overkill for this use-case. What do browsers do de facto on case mismatches on fragment IDs? A |
|
Fair point — browsers generally treat fragment IDs as case-sensitive, though behavior can vary depending on the HTML generator. My thought was mainly to avoid false negatives in edge cases (like auto-generated anchors that normalize casing differently). |
|
I can't think of drawbacks to the The anchor-checking I'm less certain about; given that we believe browsers seem to navigate to anchors case-sensitively -- something I too checked locally and that is certainly the case in Firefox 140.4 -- I'd be reluctant to offer that without a demonstrable use-case (again that can't be solved easily by fixing the source documentation). |
|
That makes sense — I’ll keep it as a single |
|
Sounds good to me! Thanks @FazeelUsmani. |
56d6a63 to
d115b1e
Compare
| .. confval:: linkcheck_case_insensitive | ||
| :type: :code-py:`bool` | ||
| :default: :code-py:`False` | ||
|
|
||
| When :code-py:`True`, the *linkcheck* builder will compare URL paths | ||
| case-insensitively when checking for redirects. | ||
| This is useful for checking links on case-insensitive servers | ||
| (for example, GitHub, Windows-based servers, or certain hosting platforms) | ||
| that may return URLs with different case than the original link. | ||
|
|
||
| When enabled, URL paths like ``/Path`` and ``/path`` are considered | ||
| equivalent, preventing false-positive redirect warnings on | ||
| case-insensitive servers. | ||
|
|
||
| .. note:: | ||
|
|
||
| This option only affects URL path comparison for redirect detection. | ||
| HTML anchor checking remains case-sensitive to match browser behavior, | ||
| where fragment identifiers (``#anchor``) are case-sensitive per the | ||
| HTML specification. | ||
|
|
||
| Example: | ||
|
|
||
| .. code-block:: python | ||
| linkcheck_case_insensitive = True | ||
| .. versionadded:: 8.2 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be my perception, but I would suggest that it may be easier to understand the config setting if we remove the inverse prefix (e.g. using sensitive instead of insensitive).
I also have a few ideas for the documentation text; please let me know what you think:
| .. confval:: linkcheck_case_insensitive | |
| :type: :code-py:`bool` | |
| :default: :code-py:`False` | |
| When :code-py:`True`, the *linkcheck* builder will compare URL paths | |
| case-insensitively when checking for redirects. | |
| This is useful for checking links on case-insensitive servers | |
| (for example, GitHub, Windows-based servers, or certain hosting platforms) | |
| that may return URLs with different case than the original link. | |
| When enabled, URL paths like ``/Path`` and ``/path`` are considered | |
| equivalent, preventing false-positive redirect warnings on | |
| case-insensitive servers. | |
| .. note:: | |
| This option only affects URL path comparison for redirect detection. | |
| HTML anchor checking remains case-sensitive to match browser behavior, | |
| where fragment identifiers (``#anchor``) are case-sensitive per the | |
| HTML specification. | |
| Example: | |
| .. code-block:: python | |
| linkcheck_case_insensitive = True | |
| .. versionadded:: 8.2 | |
| .. confval:: linkcheck_case_sensitive | |
| :type: :code-py:`bool` | |
| :default: :code-py:`True` | |
| This setting controls how the *linkcheck* builder decides | |
| whether a hyperlink's destination is the same as the URL | |
| written in the documentation. | |
| By default, *linkcheck* requires the destination URL to match the written URL case-sensitively. This means that a link to ``http://webserver.test/USERNAME`` in | |
| the documentation that the server redirects to ``http://webserver.test/username`` will be reported as ``redirected``. | |
| To allow a more lenient URL comparison, that will report the previous case as | |
| as ``working`` instead, configure this setting to ``False``. | |
| .. note:: | |
| HTML anchor checking is always case-sensitive, and is | |
| not affected by this setting. | |
| .. versionadded:: 8.2 | |
Edit: replace status code of successful with the actual value from the code, working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've applied your documentation changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again!
sphinx/builders/linkcheck.py
Outdated
| if self.case_insensitive: | ||
| normalised_url = normalised_url.casefold() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a rare scenario, but strictly speaking I think we should avoid casefolding any fragment/anchor identifier part of the URL.
e.g.
- ❌
/USERNAME#Badges->/username#badges - ✅
/USERNAME#Badges->/username#Badges
(I only realised this after attempting to rewrite the config setting documentation in my own words -- in particular the part about anchor case-sensitivity)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I've fixed the _normalise_url() function to only casefold the URL before the fragment. Now /USERNAME#Badges correctly becomes /username#Badges (preserving the fragment case). The function now splits on # and only applies .casefold() to the URL part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK; the updated URL normalisation logic looks good to me; thank you.
sphinx/builders/linkcheck.py
Outdated
| app.add_event('linkcheck-process-uri') | ||
|
|
||
| # priority 900 to happen after ``check_confval_types()`` | ||
| app.connect('config-inited', handle_deprecated_linkcheck_case_config, priority=899) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to add any deprecation/migration-handling code for this config setting, because the linkcheck_case_insensitive config setting has not been included in a published release of Sphinx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the deprecation handler function and the old config value registration.
Co-authored-by: James Addison <[email protected]>
…sphinx into linkcheck_case_insensitive
| def do_HEAD(self): | ||
| if self.path == '/path': | ||
| # Redirect lowercase /path to uppercase /Path | ||
| self.send_response(301, 'Moved Permanently') | ||
| self.send_header('Location', '/Path') | ||
| self.send_header('Content-Length', '0') | ||
| self.end_headers() | ||
| elif self.path == '/Path': | ||
| self.send_response(200, 'OK') | ||
| self.send_header('Content-Length', '0') | ||
| self.end_headers() | ||
| else: | ||
| self.send_response(404, 'Not Found') | ||
| self.send_header('Content-Length', '0') | ||
| self.end_headers() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the do_HEAD code block here, as the base HTTP server will invoke do_GET instead if it's missing.
NB: Let's also apply the Location response header simplification to do_GET when we do that, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I've removed the do_HEAD method entirely - the base HTTP server will invoke do_GET instead. I've also simplified the Location header in do_GET to just use /Path instead of the full URL with host.
| } in content | ||
|
|
||
|
|
||
| class CaseSensitiveHandler(BaseHTTPRequestHandler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly nitpicky, but: the client and the test case are about case sensitivity -- but this handler is not really. What it does is to normalise/capitalise the path. So I'd suggest renaming it:
| class CaseSensitiveHandler(BaseHTTPRequestHandler): | |
| class CapitalisePathHandler(BaseHTTPRequestHandler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed it from CaseSensitiveHandler to CapitalisePathHandler and updated the docstring to reflect what it actually does.
| lowercase_uri = f'http://{address}/path' | ||
| assert lowercase_uri in rowsby, f'Expected {lowercase_uri} to be checked' | ||
| assert rowsby[lowercase_uri]['status'] == 'redirected' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the containment check is good defensive coding, we don't perform similar checks in the other tests, so for consistency I would recommend:
| lowercase_uri = f'http://{address}/path' | |
| assert lowercase_uri in rowsby, f'Expected {lowercase_uri} to be checked' | |
| assert rowsby[lowercase_uri]['status'] == 'redirected' | |
| assert rowsby[f'http://{address}/path']['status'] == 'redirected' |
(I briefly considered suggesting renaming the variable to something like origin_url, outbound_url or documented_url -- but on balance I think we should simply omit it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've simplified both tests to directly assert on rowsby[f'http://{address}/path']['status'] without the intermediate variable or defensive containment check, matching the style used in other tests in the file
|
Ok, I think this is looking pretty good @FazeelUsmani - thanks for accommodating my feedback. The only significant remaining question I have is whether this should be controlled as a single boolean value, or whether it should be enabled for patterns of URLs -- similar to the way that Currently: I am leaning towards the idea that, ideally, documentation projects should only enable this for the domains/websites where they need the tolerance for case-normalising redirects. The reason I lean that way is that -- generally -- URLs are in fact case sensitive, and so I have some lingering doubts about making this a flag that is likely to affect cases where it's not worthwhile to casefold. PS: either way, if you could also update the title of this PR that would be great! |
|
Thanks for the feedback @jayaddison! That's a really good point about making this pattern-based rather than a global boolean. I can see how that would be safer and more precise - only applying case-insensitivity to domains that actually need it (like GitHub), similar to how So, the pattern-based approach would be something like: linkcheck_case_sensitive_patterns = {
r'https://github\.com/.*': False,
}Does this sounds good to you? |
|
That mostly sounds good - I would suggest some adjustments, though: either:
OR
I don't have a strong preference yet -- I'm continuing to think about it. The second option might make it slightly more apparent to readers that these Edit: fixup: the matching patterns are not necessarily domains. |
|
Thanks for clarifying, @jayaddison! I'll go with Option 2 - using a list with So the config would look like: linkcheck_case_insensitive = [
r'https://github\.com/.*',
]Does that sound good to you? I'll start refactoring to implement pattern matching once you confirm. |
|
@FazeelUsmani yep, that sounds good to me 👍 |
| freshenv=True, | ||
| confoverrides={'linkcheck_case_insensitive': [r'http://localhost:\d+/.*']}, | ||
| ) | ||
| def test_linkcheck_case_insensitive(app: SphinxTestApp) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: I think it would be possible to merge the two test cases into a single def test_linkcheck_case_sensitivity now that we can pattern-match different URL patterns.
(the code for the two of them is very similar currently, so perhaps the end result would be neater/smaller by combining them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explored merging the tests, but found that keeping them separate is actually clearer and simpler. Merging them would require either:
- Having two different URLs in the test document with hardcoded ports (which don't match the dynamic test
server port) - More complex test logic to construct different URLs
Since you mentioned this was optional, I've kept the two separate tests as they clearly demonstrate:
- test_linkcheck_case_sensitive: Default behavior (no patterns configured)
- test_linkcheck_case_insensitive: Pattern-based behavior (with a specific pattern)
This makes the tests easier to understand and maintain. Please let me know if you still prefer them to be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable to me, yep - thanks! One detail I'd like to ask about: with approach (2), was the more complex in the CapitalisePathHandler? (that's what I'd guess, but would like to double-check)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the complexity wasn't in the CapitalisePathHandler - that part would stay the same (it just redirects /path → /Path for any request).
The complexity would be in the test logic itself. To test both behaviors in one test, we'd need:
- Two different URLs in the test RST document (e.g.,
localhostand127.0.0.1) - The pattern configured to match only one of them (e.g., only
127.0.0.1) - Test assertions to verify different behavior for each URL
The issue is the RST document has hardcoded ports (e.g., http://localhost:7777/path), but the test server uses a dynamic port. So we'd need logic to construct the correct URLs with the dynamic port in the test assertions, which felt more complex than just having two focused tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok; yep, good observation about the dynamic port allocation for the test webserver vs the static port in the documentation.
And also point taken/agreed about the single /path -> /Path transform offered by the handler.
We could add more path transforms to the handler.. could that help? (I think it might do, but I haven't experimented with it in code yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, adding more path transforms to the handler could definitely help!
Here's what I'm thinking:
class CapitalisePathHandler(BaseHTTPRequestHandler):
"""Test server that capitalises URL paths via redirects."""
protocol_version = 'HTTP/1.1'
PATH_REDIRECTS = {
'/path1': '/Path1',
'/path2': '/Path2',
}
def do_GET(self):
if self.path in self.PATH_REDIRECTS:
self.send_response(301, 'Moved Permanently')
self.send_header('Location', self.PATH_REDIRECTS[self.path])
self.send_header('Content-Length', '0')
self.end_headers()
elif self.path in self.PATH_REDIRECTS.values():
content = b'ok\n\n'
self.send_response(200, 'OK')
self.send_header('Content-Length', str(len(content)))
self.end_headers()
self.wfile.write(content)
else:
self.send_response(404, 'Not Found')
self.send_header('Content-Length', '0')
self.end_headers()Then the test RST could have:
path1 <http://localhost:7777/path1>_
path2 <http://localhost:7777/path2>_
And we configure the pattern to match only path1:
confoverrides={'linkcheck_case_insensitive': [r'http://localhost:\d+/path1']}
This way we can assert:
- path1 → working (case-insensitive applies)
- path2 → redirected (case-sensitive applies)
The dynamic port issue is actually not a problem for assertions since we use the actual address variable there. The hardcoded port in the RST only matters for the pattern matching, which this approach handles cleanly.
Want me to implement this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, ok! Let's try that, but with a change: I'm not too keen on the hardcoded redirect paths, so perhaps we could instead test for self.path.islower(), and when it is, then redirect the client to self.path.capitalize(). I think that'd help to make the resulting code even more concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! I've implemented your suggestion with the dynamic path detection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thank you - there may be a couple of cleanups possible:
- We have both
do_GETanddo_HEADimplemented again - I think we should apply the same simplification as previously and remove one of those. - There are still two test cases, where I think we could have a single
test_linkcheck_case_sensitivitythat handles both case sensitive and non-case-sensitive variants.
I've realized that my suggestion to use self.path.capitalize() isn't applicable as-is.. sorry about that - perhaps we could use self.path.upper() instead, though.
|
|
||
|
|
||
| @pytest.mark.sphinx('html', testroot='numfig') | ||
| @pytest.mark.test_params(shared_result='test_build_html_numfig') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure to restore this pytest marker (I think you did, but maybe it got removed again somehow).
|
Ah, and one more thing: please make sure to update I think it's looking good; we'll also need a core developer/maintainer to approve it, and I can't promise if/when that will be. |
jayaddison
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature implementation looks good to me; thank you @FazeelUsmani!
|
Hi @jayaddison, |
You're welcome. I think the best response I can offer to answer that -- and I think you have followed most/all of the guidance there already -- is to refer to the Sphinx official contributing guide: https://www.sphinx-doc.org/en/master/internals/contributing.html#contribute-code |
This PR adds a new configuration option
linkcheck_ignore_caseto enable case-insensitive URL and anchor checking in the linkcheck builder.Problem
Some web servers (e.g., GitHub, certain hosting platforms) are case-insensitive and may return URLs with different casing than the original link. This causes the linkcheck builder to report false-positive redirects when the URLs differ only in case, even though they point to the same resource.
Solution
linkcheck_ignore_caseboolean configuration option (default:False)AnchorCheckParserto support case-insensitive matching when enableddoc/usage/configuration.rst