Skip to content

Conversation

@TamerineSky
Copy link

@TamerineSky TamerineSky commented Jan 7, 2026

Problem

PR #782 fixed 251 instances of missing UTF-8 encoding across 87 files. Without clear documentation, developers may not understand:

  • Why UTF-8 encoding is required
  • How to correctly specify encoding
  • When encoding should and shouldn't be used

This creates risk of regression despite automated pre-commit hooks (PR #795).

Solution

Comprehensive but concise documentation educating developers about UTF-8 encoding requirements and Windows compatibility.

Changes

1. CONTRIBUTING.md - File Encoding Section ✨

Added concise encoding section in Code Style:

Why here: Integrated with existing code style guidelines, visible to all contributors

2. guides/windows-development.md - New Guide ✨

Comprehensive Windows development guide covering:

  • File encoding - cp1252 vs UTF-8 issue, common pitfalls
  • Line endings - CRLF vs LF handling
  • Path separators - Backslash vs forward slash
  • Shell commands - Cross-platform compatibility
  • Development environment - WSL2, Git Bash, PowerShell
  • Common issues - Solutions for Windows-specific problems

Why needed: Windows has unique development challenges that deserve dedicated documentation

3. PR Template - Encoding Checklist ✨

Added checklist item:

  • [ ] (Python only) All file operations specify encoding="utf-8" for text files

Why needed: Reminds reviewers to check for encoding, catches issues during PR review

4. guides/README.md - Updated Index ✨

Added windows-development.md to the guides index alongside CLI-USAGE.md and linux.md

Documentation Structure

guides/
├── README.md                    # Guide index
├── CLI-USAGE.md                 # CLI usage guide
├── linux.md                     # Linux-specific guide
└── windows-development.md       # Windows-specific guide ✨ NEW

CONTRIBUTING.md                  # Includes encoding best practices ✨ UPDATED
.github/PULL_REQUEST_TEMPLATE.md # Includes encoding checklist ✨ UPDATED

Benefits

  1. Developer Education - Clear explanation of why encoding matters
  2. Quick Reference - Easy-to-find examples in CONTRIBUTING.md
  3. Prevent Regressions - Developers understand requirements
  4. Review Support - PR template reminds reviewers to check
  5. Windows Support - Dedicated guide for Windows developers

Relationship to Other PRs

Together these create a comprehensive solution:

  1. Technical fix (PR Fix Windows UTF-8 encoding errors across entire backend (251 instances) #782)
  2. Automation (PR Add pre-commit hook for UTF-8 encoding enforcement #795)
  3. Education (this PR)

Verification

Notes

Summary by CodeRabbit

  • Documentation
    • Added comprehensive Windows development guide with setup, cross-platform best practices, and troubleshooting.
    • Added Linux installation and build guide.
    • Updated contribution guidelines to require specifying UTF‑8 file encoding for Python text file operations.
    • Updated PR template to include a checklist item for Python file-encoding checks.

✏️ Tip: You can customize this high-level summary in your review settings.

1. CONTRIBUTING.md:
   - Added concise file encoding section after Code Style
   - DO/DON'T examples for common file operations
   - Covers open(), Path methods, json operations
   - References PR AndyMik90#782 and windows-development.md

2. guides/windows-development.md (NEW):
   - Comprehensive Windows development guide
   - File encoding (cp1252 vs UTF-8 issue)
   - Line endings, path separators, shell commands
   - Development environment recommendations
   - Common pitfalls and solutions
   - Testing guidelines

3. .github/PULL_REQUEST_TEMPLATE.md:
   - Added encoding checklist item for Python PRs
   - Helps catch missing encoding during review

4. guides/README.md:
   - Added windows-development.md to guide index
   - Organized with CLI-USAGE and linux guides

Purpose: Educate developers about UTF-8 encoding requirements to prevent
regressions of the 251 encoding issues fixed in PR AndyMik90#782. Automated checking
via pre-commit hooks (PR AndyMik90#795) + developer education ensures long-term
Windows compatibility.

Related:
- PR AndyMik90#782: Fix Windows UTF-8 encoding errors (251 instances)
- PR AndyMik90#795: Add pre-commit hooks for encoding enforcement
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 7, 2026

📝 Walkthrough

Walkthrough

Adds Windows-specific development guidance and enforces UTF-8 file-encoding practices in project docs and PR template, plus updates guides index to list platform-specific guides. Documentation changes only; no code or public API modifications.

Changes

Cohort / File(s) Summary
PR template & contribution guidelines
.github/PULL_REQUEST_TEMPLATE.md, CONTRIBUTING.md
Inserted a checklist item and a new "File Encoding (Python)" section requiring explicit encoding="utf-8" for text file operations; includes examples for text and JSON I/O. The CONTRIBUTING.md block was duplicated (appears twice).
Platform guides & index
guides/README.md, guides/windows-development.md, guides/linux.md
Added Windows and Linux guide entries to guides index. Added guides/windows-development.md with Windows-focused development guidance covering UTF-8 enforcement, line endings, path handling, subprocess usage, environment recommendations (WSL2, Git Bash, PowerShell), common Windows pitfalls, and testing guidance.

Sequence Diagram(s)

(omitted — changes are documentation and templates, not multi-component runtime control flow)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested labels

size/M

Suggested reviewers

  • AndyMik90

Poem

🐰 I nibble docs and hop with glee,
UTF‑8 now travels free,
Windows, Linux, side by side,
Cross‑platform carrots for every stride 🥕✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately summarizes the main changes: adding UTF-8 encoding guidelines and a Windows development guide, matching the core content across CONTRIBUTING.md, windows-development.md, and related documentation updates.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c9ce4b and 95f11a8.

📒 Files selected for processing (2)
  • CONTRIBUTING.md
  • guides/windows-development.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: CodeQL (python)
  • GitHub Check: CodeQL (javascript-typescript)
🔇 Additional comments (10)
CONTRIBUTING.md (1)

360-416: ✅ File Encoding section is well-implemented.

The new section effectively addresses past feedback: blank lines are properly placed around code blocks (fixing MD031 violations), the JSON write example now includes the suggested ensure_ascii=False with indent=2, and the DO/DON'T/Binary patterns provide clear guidance. The section integrates well with the guide structure and references.

guides/windows-development.md (9)

6-82: ✅ Encoding section is comprehensive and well-structured.

The section clearly explains the Windows cp1252 vs UTF-8 issue, provides actionable solutions with multiple examples, and includes practical testing guidance. The three pitfalls section effectively highlights common mistakes with correct patterns. The cross-reference to CONTRIBUTING.md with proper anchor linking integrates the guides well.


84-112: ✅ Line endings section implements suggested best practices.

The section correctly uses the idiomatic splitlines() approach (addressing the previous gemini suggestion) and provides a layered solution: git configuration, .gitattributes explanation, and code-level normalization. The explanation of the CRLF/LF difference and git diff impact is clear and practical.


114-154: ✅ Path handling guidance is practical and comprehensive.

The section clearly demonstrates both pathlib (recommended) and os.path.join() approaches with concrete examples showing common mistakes. The emphasis on avoiding hardcoded separators and the clear "Wrong" vs "Correct" labeling makes the guidance actionable.


156-197: ✅ Shell commands section provides clear cross-platform strategies.

The layered approach (prefer Python libraries, then shlex+subprocess, then platform detection) is practical and well-explained. Examples demonstrate real use cases (file operations, git commands) with proper encoding specification in the subprocess call, maintaining consistency with the encoding emphasis throughout the guide.


199-231: ✅ Development environment guidance is balanced and practical.

The section presents three viable options with clear trade-offs (WSL2 as recommended for production parity, Git Bash for lighter weight, PowerShell for native environment). VS Code settings correctly enforce UTF-8 encoding and LF line endings, aligning with the guide's emphasis. Recommendations are pragmatic.


233-271: ✅ Common issues section addresses real Windows pain points.

The three issues (permissions, long paths, case sensitivity) are genuine Windows friction points with practical solutions. Each solution is actionable: context managers for file handling, Windows 10+ long path settings plus alternatives, and consistent casing examples. The content demonstrates real understanding of Windows development challenges.


273-312: ✅ Testing section provides clear guidance and executable examples.

The pre-commit and test commands are clear and actionable. The Windows-specific test example correctly uses pytest.mark.skipif with proper platform detection, includes non-ASCII test data (emoji, international characters), and demonstrates proper encoding usage. The guidance effectively emphasizes the importance of testing character handling on Windows.


314-328: ✅ Help and resources sections provide good escalation path and references.

The Getting Help section appropriately escalates from documentation through issues/discussions with a tagged issue option. The Resources section includes relevant external documentation. The links appear well-chosen for the guide's scope.

Verify that the WSL2 documentation link (line 328) reflects the current Microsoft URL structure, as documentation URLs sometimes change.


330-336: ✅ Related section effectively integrates complementary PR initiatives.

The section appropriately references the educational (this guide), technical fix (PR #782), and automated enforcement (PR #795) components of the encoding initiative. Cross-references are properly formatted with relative and absolute paths as appropriate.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @TamerineSky, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the project's documentation by providing clear guidelines and a dedicated resource to address UTF-8 encoding requirements and Windows-specific development challenges. It aims to educate developers, prevent regressions related to character encoding, and streamline cross-platform compatibility. By integrating encoding best practices into CONTRIBUTING.md, adding a new comprehensive Windows guide, and incorporating an encoding checklist into the PR template, this PR creates a robust educational and preventative framework, building upon previous technical fixes and automated enforcement efforts.

Highlights

  • New File Encoding Guidelines: A new section has been added to CONTRIBUTING.md providing clear guidelines and examples for specifying encoding="utf-8" in Python file operations, addressing common pitfalls with open(), Path methods, and JSON handling.
  • Dedicated Windows Development Guide: A comprehensive new guide, guides/windows-development.md, has been introduced to cover Windows-specific development challenges, including file encoding, line endings, path separators, shell commands, and recommended development environments like WSL2.
  • PR Template Encoding Checklist: A new checklist item has been added to the pull request template, reminding contributors and reviewers to ensure all Python file operations specify encoding="utf-8" for text files.
  • Updated Documentation Index: The guides/README.md file has been updated to include the newly added windows-development.md in the main documentation index.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @CONTRIBUTING.md:
- Around line 369-407: The markdown blocks under the "DO:", "DON'T:", and
"Binary files - NO encoding:" headings are missing blank lines around their
fenced code blocks (MD031); to fix, add a single blank line immediately before
each opening ```python fence and a single blank line immediately after each
closing ``` fence for those three sections so each code block is separated from
the surrounding text and complies with the linter.

In @guides/windows-development.md:
- Around line 12-14: Several markdownlint violations need fixing: add blank
lines before and after every fenced code block (including the UnicodeDecodeError
block) and insert appropriate fenced-code language identifiers (e.g.,
```plaintext or ```python) where missing; ensure there is a blank line before
headings like "Pitfall 1: JSON files", "Pitfall 2: Path methods", and "Pitfall
3: Subprocess output"; consolidate or rename repeated "The Problem" / "The
Solution" headings into unique headings or a single section with subsections to
resolve MD024; wrap long lines to <=80 characters to fix MD013; and add a blank
line before the list that triggers MD032—apply these edits throughout the
document to resolve the reported 39 style violations.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a47354b and 4c9ce4b.

📒 Files selected for processing (4)
  • .github/PULL_REQUEST_TEMPLATE.md
  • CONTRIBUTING.md
  • guides/README.md
  • guides/windows-development.md
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
CONTRIBUTING.md

370-370: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


391-391: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


404-404: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

guides/windows-development.md

12-12: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


12-12: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


27-27: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


33-33: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


38-38: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


44-44: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


45-45: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


55-55: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


56-56: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


64-64: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


65-65: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


82-82: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


87-87: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


87-87: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


94-94: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


108-108: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


119-119: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


127-127: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


146-146: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


156-156: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


165-165: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


200-200: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


216-216: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


237-237: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


250-250: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


255-255: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


261-261: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: CodeQL (javascript-typescript)
  • GitHub Check: CodeQL (python)
🔇 Additional comments (4)
.github/PULL_REQUEST_TEMPLATE.md (1)

42-42: ✓ Clear and actionable checklist item.

The addition is well-scoped to Python-only changes and properly formatted. Placing it alongside code principles aligns nicely with the broader encoding initiative.

guides/README.md (1)

10-11: ✓ Consistent table formatting and clear descriptions.

The new guide entries follow the same format as the existing entry and provide helpful summaries of what each guide covers.

guides/windows-development.md (1)

1-303: ✓ Comprehensive and technically sound content.

The guide provides excellent practical guidance for Windows developers:

  • Clear problem/solution structure for each topic
  • Practical code examples with "Wrong" vs "Correct" patterns
  • Well-organized sections covering the key Windows pain points (encoding, line endings, paths, shells)
  • Helpful cross-references to CONTRIBUTING.md and related PRs
  • Good Windows-specific test guidance with pytest markers (lines 274-281)
  • Appropriate recommendation hierarchy (WSL2 preferred, then Git Bash, then PowerShell)

Once the Markdown style violations are resolved, this will be a valuable resource for the team.

CONTRIBUTING.md (1)

360-410: Clear, actionable encoding guidance for Python developers.

The new File Encoding section is well-structured with concrete DO/DON'T examples covering the most common file operations (open, pathlib, JSON). The emphasis on Windows defaults and explicit references to related PRs and the Windows development guide provide good context for contributors. The guidance correctly reflects best practices for cross-platform Python development.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds comprehensive documentation regarding UTF-8 encoding and a development guide for Windows users. The changes are well-structured and the documentation is clear and helpful. I've provided a couple of minor suggestions to further improve the examples and code snippets in the new documentation.

CONTRIBUTING.md Outdated
Comment on lines 384 to 387
# JSON files
import json
with open(path, encoding="utf-8") as f:
data = json.load(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This is a great example for reading JSON. To make it even more helpful, consider adding an example for writing JSON files as well. This would provide a complete picture, especially since you correctly point out the wrong way to use json.dump in the 'DON'T' section.

A correct example would be:

# Writing JSON
with open(path, "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 95f11a8

Great suggestion! I've added a JSON write example to complement the read example:

# Writing JSON
with open(path, "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

This provides a complete picture of the correct approach for both reading and writing JSON files with UTF-8 encoding. The ensure_ascii=False parameter is particularly important to preserve non-ASCII characters in the output.

3. **In code, normalize when processing:**
```python
# Normalize line endings to LF
content = content.replace("\r\n", "\n").replace("\r", "\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While this works, a more robust and idiomatic way to normalize line endings in Python is to use splitlines() and join(). This handles all common line endings (\n, \r, \r\n) correctly and can be slightly cleaner.

Suggested change
content = content.replace("\r\n", "\n").replace("\r", "\n")
content = "\n".join(content.splitlines())

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 95f11a8

Excellent suggestion! I've updated the line ending normalization example to use the more idiomatic splitlines() and join() approach:

# Normalize line endings to LF (idiomatic approach)
content = "\n".join(content.splitlines())

This is indeed more robust as it correctly handles all common line endings (\n, \r, \r\n) and is cleaner than other approaches. Thank you for the improvement!

1. Fix CONTRIBUTING.md markdown linting issues
   - Add blank lines around code blocks (MD031)
   - Add JSON write example with ensure_ascii=False (Gemini suggestion)

2. Fix guides/windows-development.md markdown linting (39 violations)
   - Rename duplicate headings: "The Problem"/"The Solution" → "Problem"/"Solution" (MD024)
   - Add blank lines around all code blocks (MD031)
   - Add language specifiers to code blocks (MD040)
   - Add blank lines before/after headings (MD022)
   - Wrap long lines to <=80 characters (MD013)
   - Add blank line before list (MD032)
   - Use Gemini's idiomatic line ending normalization pattern

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant