Skip to content

test: add tests for tokenizer package#1046

Merged
volcano-sh-bot merged 2 commits into
volcano-sh:mainfrom
anirudh240:test/add-tests-tokenizer
May 18, 2026
Merged

test: add tests for tokenizer package#1046
volcano-sh-bot merged 2 commits into
volcano-sh:mainfrom
anirudh240:test/add-tests-tokenizer

Conversation

@anirudh240
Copy link
Copy Markdown
Contributor

@anirudh240 anirudh240 commented May 14, 2026

/kind enhancement

What this PR does / why we need it:
unit tests for the tokenizer package which had zero coverage. Covers SimpleEstimateTokenizer and TickToken.

Which issue(s) this PR fixes:
Fixes #1045

Special notes for your reviewer:
nothing

Does this PR introduce a user-facing change?:


Signed-off-by: Anirudh <2410030013@klh.edu.in>
Copilot AI review requested due to automatic review settings May 14, 2026 06:09
@volcano-sh-bot volcano-sh-bot added the kind/enhancement New feature or request label May 14, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces unit tests for the SimpleEstimateTokenizer and TickToken implementations to verify token count calculations. The reviewer identified several typos in the documentation comments and suggested improvements to the test suite, such as adding test cases for multi-byte UTF-8 characters to clarify counting logic and using exact token counts instead of ranges for deterministic tests. Additionally, the reviewer noted that the wantErr field in the test structures is currently unused.

Comment thread pkg/kthena-router/filters/tokenizer/tokenizer_test.go Outdated
Comment on lines +71 to +72
want: 11, // ceil(43/4)
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of SimpleEstimateTokenizer uses len(prompt), which counts bytes. It would be beneficial to add a test case with multi-byte UTF-8 characters to clarify whether the heuristic should be based on bytes or runes (characters).

			want:   11, // ceil(43/4)
		},
		{
			name:   "multi-byte characters (UTF-8)",
			prompt: "世界", // 6 bytes, 2 runes
			want:   2,    // ceil(6/4)
		},

Comment thread pkg/kthena-router/filters/tokenizer/tokenizer_test.go Outdated
Comment on lines +94 to +138
tests := []struct {
name string
prompt string
wantMin int
wantMax int
wantErr bool
}{
{
name: "empty string returns 0 tokens",
prompt: "",
wantMin: 0,
wantMax: 0,
},
{
name: "simple english word",
prompt: "hello",
wantMin: 1,
wantMax: 3,
},
{
name: "sentence produces reasonable token count",
prompt: "The quick brown fox",
wantMin: 3,
wantMax: 8,
},
{
name: "longer prompt produces more tokens than shorter",
prompt: "This is a much longer sentence that should produce more tokens than a single word",
wantMin: 10,
wantMax: 30,
},
}

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got, err := tokenizer.CalculateTokenNum(tt.prompt)
if tt.wantErr {
require.Error(t, err)
} else {
require.NoError(t, err)
assert.GreaterOrEqual(t, got, tt.wantMin)
assert.LessOrEqual(t, got, tt.wantMax)
}
})
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since TickToken uses the deterministic cl100k_base encoding, the tests should assert exact token counts instead of ranges. This provides better protection against regressions. Additionally, the wantErr field is currently unused in the test cases; consider adding a case that exercises the error path if possible, or removing it if errors are not expected in this context.

	tests := []struct {
		name    string
		prompt  string
		want    int
		wantErr bool
	}{
		{
			name:    "empty string returns 0 tokens",
			prompt:  "",
			want:    0,
		},
		{
			name:    "simple english word",
			prompt:  "hello",
			want:    1,
		},
		{
			name:    "sentence produces reasonable token count",
			prompt:  "The quick brown fox",
			want:    4,
		},
		{
			name:    "longer prompt produces more tokens than shorter",
			prompt:  "This is a much longer sentence that should produce more tokens than a single word",
			want:    15,
		},
	}

	for _, tt := range tests {
		t.Run(tt.name, func(t *testing.T) {
			got, err := tokenizer.CalculateTokenNum(tt.prompt)
			if tt.wantErr {
				require.Error(t, err)
			} else {
				require.NoError(t, err)
				assert.Equal(t, tt.want, got)
			}
		})
	}

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds initial unit test coverage for the pkg/kthena-router/filters/tokenizer package (issue #1045), focusing on validating the 4-chars-per-token heuristic (SimpleEstimateTokenizer) and basic sanity checks for the TickToken (cl100k_base, offline BPE) implementation.

Changes:

  • Added table-driven tests for SimpleEstimateTokenizer.CalculateTokenNum covering rounding/ceiling behavior and basic inputs.
  • Added basic “reasonable range” tests for TickToken.CalculateTokenNum using offline tokenization.
  • Introduced new tokenizer test file to establish baseline coverage for a previously untested package.
Comments suppressed due to low confidence (3)

pkg/kthena-router/filters/tokenizer/tokenizer_test.go:28

  • Spelling in test comment: "cieling divison" should be "ceiling division".
// including the cieling divison for non-multiples of 4

pkg/kthena-router/filters/tokenizer/tokenizer_test.go:88

  • Spelling in test comment: "resonable" should be "reasonable".
// TestTickToken verifies that TickToken produces resonable token counts

pkg/kthena-router/filters/tokenizer/tokenizer_test.go:89

  • Spelling/formatting in test comment: "cl100k_ base" should be "cl100k_base".
// using the cl100k_ base encoding via the offline BPE loader

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/kthena-router/filters/tokenizer/tokenizer_test.go Outdated
Comment on lines +133 to +136
require.NoError(t, err)
assert.GreaterOrEqual(t, got, tt.wantMin)
assert.LessOrEqual(t, got, tt.wantMax)
}
Signed-off-by: Anirudh <2410030013@klh.edu.in>
@StLeoX
Copy link
Copy Markdown
Contributor

StLeoX commented May 14, 2026

Perhaps a more widely used dataset should be introduced to replace a few simple cases?

@anirudh240
Copy link
Copy Markdown
Contributor Author

Perhaps a more widely used dataset should be introduced to replace a few simple cases?

right but i feel theyre slightly an overkill considering these are just unit tests where i could just use simple controlled inputs

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot merged commit b510621 into volcano-sh:main May 18, 2026
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test: add tests for tokenizer package

5 participants