-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
docs: Add UTF-8 encoding guidelines and Windows development guide #799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
docs: Add UTF-8 encoding guidelines and Windows development guide #799
Conversation
1. CONTRIBUTING.md: - Added concise file encoding section after Code Style - DO/DON'T examples for common file operations - Covers open(), Path methods, json operations - References PR AndyMik90#782 and windows-development.md 2. guides/windows-development.md (NEW): - Comprehensive Windows development guide - File encoding (cp1252 vs UTF-8 issue) - Line endings, path separators, shell commands - Development environment recommendations - Common pitfalls and solutions - Testing guidelines 3. .github/PULL_REQUEST_TEMPLATE.md: - Added encoding checklist item for Python PRs - Helps catch missing encoding during review 4. guides/README.md: - Added windows-development.md to guide index - Organized with CLI-USAGE and linux guides Purpose: Educate developers about UTF-8 encoding requirements to prevent regressions of the 251 encoding issues fixed in PR AndyMik90#782. Automated checking via pre-commit hooks (PR AndyMik90#795) + developer education ensures long-term Windows compatibility. Related: - PR AndyMik90#782: Fix Windows UTF-8 encoding errors (251 instances) - PR AndyMik90#795: Add pre-commit hooks for encoding enforcement
📝 WalkthroughWalkthroughAdds Windows-specific development guidance and enforces UTF-8 file-encoding practices in project docs and PR template, plus updates guides index to list platform-specific guides. Documentation changes only; no code or public API modifications. Changes
Sequence Diagram(s)(omitted — changes are documentation and templates, not multi-component runtime control flow) Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
🔇 Additional comments (10)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @TamerineSky, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the project's documentation by providing clear guidelines and a dedicated resource to address UTF-8 encoding requirements and Windows-specific development challenges. It aims to educate developers, prevent regressions related to character encoding, and streamline cross-platform compatibility. By integrating encoding best practices into Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @CONTRIBUTING.md:
- Around line 369-407: The markdown blocks under the "DO:", "DON'T:", and
"Binary files - NO encoding:" headings are missing blank lines around their
fenced code blocks (MD031); to fix, add a single blank line immediately before
each opening ```python fence and a single blank line immediately after each
closing ``` fence for those three sections so each code block is separated from
the surrounding text and complies with the linter.
In @guides/windows-development.md:
- Around line 12-14: Several markdownlint violations need fixing: add blank
lines before and after every fenced code block (including the UnicodeDecodeError
block) and insert appropriate fenced-code language identifiers (e.g.,
```plaintext or ```python) where missing; ensure there is a blank line before
headings like "Pitfall 1: JSON files", "Pitfall 2: Path methods", and "Pitfall
3: Subprocess output"; consolidate or rename repeated "The Problem" / "The
Solution" headings into unique headings or a single section with subsections to
resolve MD024; wrap long lines to <=80 characters to fix MD013; and add a blank
line before the list that triggers MD032—apply these edits throughout the
document to resolve the reported 39 style violations.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
.github/PULL_REQUEST_TEMPLATE.mdCONTRIBUTING.mdguides/README.mdguides/windows-development.md
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
CONTRIBUTING.md
370-370: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
391-391: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
404-404: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
guides/windows-development.md
12-12: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
12-12: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
27-27: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
33-33: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
38-38: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
44-44: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
45-45: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
55-55: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
56-56: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
64-64: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
65-65: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
82-82: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
87-87: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
87-87: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
94-94: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
108-108: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
119-119: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
127-127: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
146-146: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
156-156: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
165-165: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
200-200: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
216-216: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
237-237: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
250-250: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
255-255: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
261-261: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: CodeQL (javascript-typescript)
- GitHub Check: CodeQL (python)
🔇 Additional comments (4)
.github/PULL_REQUEST_TEMPLATE.md (1)
42-42: ✓ Clear and actionable checklist item.The addition is well-scoped to Python-only changes and properly formatted. Placing it alongside code principles aligns nicely with the broader encoding initiative.
guides/README.md (1)
10-11: ✓ Consistent table formatting and clear descriptions.The new guide entries follow the same format as the existing entry and provide helpful summaries of what each guide covers.
guides/windows-development.md (1)
1-303: ✓ Comprehensive and technically sound content.The guide provides excellent practical guidance for Windows developers:
- Clear problem/solution structure for each topic
- Practical code examples with "Wrong" vs "Correct" patterns
- Well-organized sections covering the key Windows pain points (encoding, line endings, paths, shells)
- Helpful cross-references to CONTRIBUTING.md and related PRs
- Good Windows-specific test guidance with pytest markers (lines 274-281)
- Appropriate recommendation hierarchy (WSL2 preferred, then Git Bash, then PowerShell)
Once the Markdown style violations are resolved, this will be a valuable resource for the team.
CONTRIBUTING.md (1)
360-410: Clear, actionable encoding guidance for Python developers.The new File Encoding section is well-structured with concrete DO/DON'T examples covering the most common file operations (open, pathlib, JSON). The emphasis on Windows defaults and explicit references to related PRs and the Windows development guide provide good context for contributors. The guidance correctly reflects best practices for cross-platform Python development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds comprehensive documentation regarding UTF-8 encoding and a development guide for Windows users. The changes are well-structured and the documentation is clear and helpful. I've provided a couple of minor suggestions to further improve the examples and code snippets in the new documentation.
CONTRIBUTING.md
Outdated
| # JSON files | ||
| import json | ||
| with open(path, encoding="utf-8") as f: | ||
| data = json.load(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great example for reading JSON. To make it even more helpful, consider adding an example for writing JSON files as well. This would provide a complete picture, especially since you correctly point out the wrong way to use json.dump in the 'DON'T' section.
A correct example would be:
# Writing JSON
with open(path, "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Fixed in commit 95f11a8
Great suggestion! I've added a JSON write example to complement the read example:
# Writing JSON
with open(path, "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)This provides a complete picture of the correct approach for both reading and writing JSON files with UTF-8 encoding. The ensure_ascii=False parameter is particularly important to preserve non-ASCII characters in the output.
guides/windows-development.md
Outdated
| 3. **In code, normalize when processing:** | ||
| ```python | ||
| # Normalize line endings to LF | ||
| content = content.replace("\r\n", "\n").replace("\r", "\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this works, a more robust and idiomatic way to normalize line endings in Python is to use splitlines() and join(). This handles all common line endings (\n, \r, \r\n) correctly and can be slightly cleaner.
| content = content.replace("\r\n", "\n").replace("\r", "\n") | |
| content = "\n".join(content.splitlines()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Fixed in commit 95f11a8
Excellent suggestion! I've updated the line ending normalization example to use the more idiomatic splitlines() and join() approach:
# Normalize line endings to LF (idiomatic approach)
content = "\n".join(content.splitlines())This is indeed more robust as it correctly handles all common line endings (\n, \r, \r\n) and is cleaner than other approaches. Thank you for the improvement!
1. Fix CONTRIBUTING.md markdown linting issues - Add blank lines around code blocks (MD031) - Add JSON write example with ensure_ascii=False (Gemini suggestion) 2. Fix guides/windows-development.md markdown linting (39 violations) - Rename duplicate headings: "The Problem"/"The Solution" → "Problem"/"Solution" (MD024) - Add blank lines around all code blocks (MD031) - Add language specifiers to code blocks (MD040) - Add blank lines before/after headings (MD022) - Wrap long lines to <=80 characters (MD013) - Add blank line before list (MD032) - Use Gemini's idiomatic line ending normalization pattern 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Problem
PR #782 fixed 251 instances of missing UTF-8 encoding across 87 files. Without clear documentation, developers may not understand:
This creates risk of regression despite automated pre-commit hooks (PR #795).
Solution
Comprehensive but concise documentation educating developers about UTF-8 encoding requirements and Windows compatibility.
Changes
1. CONTRIBUTING.md - File Encoding Section ✨
Added concise encoding section in Code Style:
open(),Path.read_text(),Path.write_text(),jsonoperationsWhy here: Integrated with existing code style guidelines, visible to all contributors
2. guides/windows-development.md - New Guide ✨
Comprehensive Windows development guide covering:
Why needed: Windows has unique development challenges that deserve dedicated documentation
3. PR Template - Encoding Checklist ✨
Added checklist item:
[ ] (Python only) All file operations specify encoding="utf-8" for text filesWhy needed: Reminds reviewers to check for encoding, catches issues during PR review
4. guides/README.md - Updated Index ✨
Added windows-development.md to the guides index alongside CLI-USAGE.md and linux.md
Documentation Structure
Benefits
Relationship to Other PRs
Together these create a comprehensive solution:
Verification
guides/)Notes
guides/directory (notdocs/) to match project structureSummary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.