test: add tests for tokenizer package#1046
Conversation
Signed-off-by: Anirudh <2410030013@klh.edu.in>
There was a problem hiding this comment.
Code Review
This pull request introduces unit tests for the SimpleEstimateTokenizer and TickToken implementations to verify token count calculations. The reviewer identified several typos in the documentation comments and suggested improvements to the test suite, such as adding test cases for multi-byte UTF-8 characters to clarify counting logic and using exact token counts instead of ranges for deterministic tests. Additionally, the reviewer noted that the wantErr field in the test structures is currently unused.
| want: 11, // ceil(43/4) | ||
| }, |
There was a problem hiding this comment.
The current implementation of SimpleEstimateTokenizer uses len(prompt), which counts bytes. It would be beneficial to add a test case with multi-byte UTF-8 characters to clarify whether the heuristic should be based on bytes or runes (characters).
want: 11, // ceil(43/4)
},
{
name: "multi-byte characters (UTF-8)",
prompt: "世界", // 6 bytes, 2 runes
want: 2, // ceil(6/4)
},| tests := []struct { | ||
| name string | ||
| prompt string | ||
| wantMin int | ||
| wantMax int | ||
| wantErr bool | ||
| }{ | ||
| { | ||
| name: "empty string returns 0 tokens", | ||
| prompt: "", | ||
| wantMin: 0, | ||
| wantMax: 0, | ||
| }, | ||
| { | ||
| name: "simple english word", | ||
| prompt: "hello", | ||
| wantMin: 1, | ||
| wantMax: 3, | ||
| }, | ||
| { | ||
| name: "sentence produces reasonable token count", | ||
| prompt: "The quick brown fox", | ||
| wantMin: 3, | ||
| wantMax: 8, | ||
| }, | ||
| { | ||
| name: "longer prompt produces more tokens than shorter", | ||
| prompt: "This is a much longer sentence that should produce more tokens than a single word", | ||
| wantMin: 10, | ||
| wantMax: 30, | ||
| }, | ||
| } | ||
|
|
||
| for _, tt := range tests { | ||
| t.Run(tt.name, func(t *testing.T) { | ||
| got, err := tokenizer.CalculateTokenNum(tt.prompt) | ||
| if tt.wantErr { | ||
| require.Error(t, err) | ||
| } else { | ||
| require.NoError(t, err) | ||
| assert.GreaterOrEqual(t, got, tt.wantMin) | ||
| assert.LessOrEqual(t, got, tt.wantMax) | ||
| } | ||
| }) | ||
| } |
There was a problem hiding this comment.
Since TickToken uses the deterministic cl100k_base encoding, the tests should assert exact token counts instead of ranges. This provides better protection against regressions. Additionally, the wantErr field is currently unused in the test cases; consider adding a case that exercises the error path if possible, or removing it if errors are not expected in this context.
tests := []struct {
name string
prompt string
want int
wantErr bool
}{
{
name: "empty string returns 0 tokens",
prompt: "",
want: 0,
},
{
name: "simple english word",
prompt: "hello",
want: 1,
},
{
name: "sentence produces reasonable token count",
prompt: "The quick brown fox",
want: 4,
},
{
name: "longer prompt produces more tokens than shorter",
prompt: "This is a much longer sentence that should produce more tokens than a single word",
want: 15,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got, err := tokenizer.CalculateTokenNum(tt.prompt)
if tt.wantErr {
require.Error(t, err)
} else {
require.NoError(t, err)
assert.Equal(t, tt.want, got)
}
})
}There was a problem hiding this comment.
Pull request overview
Adds initial unit test coverage for the pkg/kthena-router/filters/tokenizer package (issue #1045), focusing on validating the 4-chars-per-token heuristic (SimpleEstimateTokenizer) and basic sanity checks for the TickToken (cl100k_base, offline BPE) implementation.
Changes:
- Added table-driven tests for
SimpleEstimateTokenizer.CalculateTokenNumcovering rounding/ceiling behavior and basic inputs. - Added basic “reasonable range” tests for
TickToken.CalculateTokenNumusing offline tokenization. - Introduced new tokenizer test file to establish baseline coverage for a previously untested package.
Comments suppressed due to low confidence (3)
pkg/kthena-router/filters/tokenizer/tokenizer_test.go:28
- Spelling in test comment: "cieling divison" should be "ceiling division".
// including the cieling divison for non-multiples of 4
pkg/kthena-router/filters/tokenizer/tokenizer_test.go:88
- Spelling in test comment: "resonable" should be "reasonable".
// TestTickToken verifies that TickToken produces resonable token counts
pkg/kthena-router/filters/tokenizer/tokenizer_test.go:89
- Spelling/formatting in test comment: "cl100k_ base" should be "cl100k_base".
// using the cl100k_ base encoding via the offline BPE loader
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| require.NoError(t, err) | ||
| assert.GreaterOrEqual(t, got, tt.wantMin) | ||
| assert.LessOrEqual(t, got, tt.wantMax) | ||
| } |
Signed-off-by: Anirudh <2410030013@klh.edu.in>
|
Perhaps a more widely used dataset should be introduced to replace a few simple cases? |
right but i feel theyre slightly an overkill considering these are just unit tests where i could just use simple controlled inputs |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hzxuzhonghu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/kind enhancement
What this PR does / why we need it:
unit tests for the tokenizer package which had zero coverage. Covers SimpleEstimateTokenizer and TickToken.
Which issue(s) this PR fixes:
Fixes #1045
Special notes for your reviewer:
nothing
Does this PR introduce a user-facing change?: