-
Notifications
You must be signed in to change notification settings - Fork 50
完善空白符判断逻辑,兼容特殊字符(如\u200B)出现的场景 #126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the isBlankChar method in the Utils class to detect a broader range of format control characters and adds comprehensive test coverage for the method.
Key Changes:
- Generalizes the blank character detection from checking a single specific character (U+202A) to checking all FORMAT category characters using
Character.getType(c) == Character.FORMAT - Adds new test file
UtilsTest.javawith comprehensive test cases covering various whitespace and format control characters, including Unicode characters commonly found in Word documents
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| common/src/main/java/io/mybatis/common/util/Utils.java | Replaces hardcoded check for U+202A with a generalized check for all FORMAT characters to improve detection of format control characters |
| common/src/test/java/io/mybatis/common/util/UtilsTest.java | Adds new test class with comprehensive test coverage for isBlankChar method, including common whitespace characters and special Unicode format characters |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,36 @@ | |||
| package io.mybatis.common.util; | |||
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is missing the Apache License header that is present in other test files in this project (e.g., I18nTest.java). For consistency, please add the standard license header at the beginning of this file.
| || Character.isSpaceChar(c) | ||
| || c == '\ufeff' | ||
| || c == '\u202a'; | ||
| || Character.getType(c) == Character.FORMAT; |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change from checking a specific character (c == '\u202a') to checking all FORMAT characters (Character.getType(c) == Character.FORMAT) significantly broadens the scope. The Character.FORMAT category includes directional formatting marks (U+202A-U+202E, U+061C) and other format control characters that control text rendering but don't represent whitespace or blank space.
According to the Javadoc comment above this method (lines 83-84), blank characters should include "空格、制表符、全角空格和不间断空格" (space, tab, full-width space, and non-breaking space). Directional formatting marks like U+202A (LEFT-TO-RIGHT EMBEDDING) don't semantically represent blank space - they control text direction.
Consider whether all FORMAT characters should truly be treated as "blank," or if only specific zero-width space characters (like U+200B) should be added to the existing check. If the intent is to detect all invisible characters, the documentation should be updated to reflect this broader definition.
| public static boolean isBlankChar(int c) { | ||
| return Character.isWhitespace(c) | ||
| || Character.isSpaceChar(c) | ||
| || c == '\ufeff' |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explicit check for '\ufeff' on line 95 is now redundant since U+FEFF (Zero Width No-Break Space / BOM) is a FORMAT character and will be matched by Character.getType(c) == Character.FORMAT on line 96. Consider removing the explicit check on line 95 to avoid duplication.
| || c == '\ufeff' |
来自外部数据源(如Word文档)的字符串偶然会出现 \u200B 等特殊字符,应当视为空白符处理。因此优化了 Utils 里面的 isBlankChar 方法的判断逻辑。详见源码及单元测试用例。