test: add tests for tokenizer package by anirudh240 · Pull Request #1046 · volcano-sh/kthena

anirudh240 · 2026-05-14T06:09:35Z

/kind enhancement

What this PR does / why we need it:
unit tests for the tokenizer package which had zero coverage. Covers SimpleEstimateTokenizer and TickToken.

Which issue(s) this PR fixes:
Fixes #1045

Special notes for your reviewer:
nothing

Does this PR introduce a user-facing change?:

Signed-off-by: Anirudh <2410030013@klh.edu.in>

gemini-code-assist

Code Review

This pull request introduces unit tests for the SimpleEstimateTokenizer and TickToken implementations to verify token count calculations. The reviewer identified several typos in the documentation comments and suggested improvements to the test suite, such as adding test cases for multi-byte UTF-8 characters to clarify counting logic and using exact token counts instead of ranges for deterministic tests. Additionally, the reviewer noted that the wantErr field in the test structures is currently unused.

gemini-code-assist · 2026-05-14T06:11:14Z

+			want:   11, // ceil(43/4)
+		},


The current implementation of SimpleEstimateTokenizer uses len(prompt), which counts bytes. It would be beneficial to add a test case with multi-byte UTF-8 characters to clarify whether the heuristic should be based on bytes or runes (characters).

want: 11, // ceil(43/4) }, { name: "multi-byte characters (UTF-8)", prompt: "世界", // 6 bytes, 2 runes want: 2, // ceil(6/4) },

gemini-code-assist · 2026-05-14T06:11:14Z

+	tests := []struct {
+		name    string
+		prompt  string
+		wantMin int
+		wantMax int
+		wantErr bool
+	}{
+		{
+			name:    "empty string returns 0 tokens",
+			prompt:  "",
+			wantMin: 0,
+			wantMax: 0,
+		},
+		{
+			name:    "simple english word",
+			prompt:  "hello",
+			wantMin: 1,
+			wantMax: 3,
+		},
+		{
+			name:    "sentence produces reasonable token count",
+			prompt:  "The quick brown fox",
+			wantMin: 3,
+			wantMax: 8,
+		},
+		{
+			name:    "longer prompt produces more tokens than shorter",
+			prompt:  "This is a much longer sentence that should produce more tokens than a single word",
+			wantMin: 10,
+			wantMax: 30,
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			got, err := tokenizer.CalculateTokenNum(tt.prompt)
+			if tt.wantErr {
+				require.Error(t, err)
+			} else {
+				require.NoError(t, err)
+				assert.GreaterOrEqual(t, got, tt.wantMin)
+				assert.LessOrEqual(t, got, tt.wantMax)
+			}
+		})
+	}


Since TickToken uses the deterministic cl100k_base encoding, the tests should assert exact token counts instead of ranges. This provides better protection against regressions. Additionally, the wantErr field is currently unused in the test cases; consider adding a case that exercises the error path if possible, or removing it if errors are not expected in this context.

tests := []struct { name string prompt string want int wantErr bool }{ { name: "empty string returns 0 tokens", prompt: "", want: 0, }, { name: "simple english word", prompt: "hello", want: 1, }, { name: "sentence produces reasonable token count", prompt: "The quick brown fox", want: 4, }, { name: "longer prompt produces more tokens than shorter", prompt: "This is a much longer sentence that should produce more tokens than a single word", want: 15, }, } for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { got, err := tokenizer.CalculateTokenNum(tt.prompt) if tt.wantErr { require.Error(t, err) } else { require.NoError(t, err) assert.Equal(t, tt.want, got) } }) }

Copilot

Pull request overview

Adds initial unit test coverage for the pkg/kthena-router/filters/tokenizer package (issue #1045), focusing on validating the 4-chars-per-token heuristic (SimpleEstimateTokenizer) and basic sanity checks for the TickToken (cl100k_base, offline BPE) implementation.

Changes:

Added table-driven tests for SimpleEstimateTokenizer.CalculateTokenNum covering rounding/ceiling behavior and basic inputs.
Added basic “reasonable range” tests for TickToken.CalculateTokenNum using offline tokenization.
Introduced new tokenizer test file to establish baseline coverage for a previously untested package.

Comments suppressed due to low confidence (3)

pkg/kthena-router/filters/tokenizer/tokenizer_test.go:28

Spelling in test comment: "cieling divison" should be "ceiling division".

// including the cieling divison for non-multiples of 4

pkg/kthena-router/filters/tokenizer/tokenizer_test.go:88

Spelling in test comment: "resonable" should be "reasonable".

// TestTickToken verifies that TickToken produces resonable token counts

pkg/kthena-router/filters/tokenizer/tokenizer_test.go:89

Spelling/formatting in test comment: "cl100k_ base" should be "cl100k_base".

// using the cl100k_ base encoding via the offline BPE loader

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+				require.NoError(t, err)
+				assert.GreaterOrEqual(t, got, tt.wantMin)
+				assert.LessOrEqual(t, got, tt.wantMax)
+			}


Signed-off-by: Anirudh <2410030013@klh.edu.in>

StLeoX · 2026-05-14T08:49:55Z

Perhaps a more widely used dataset should be introduced to replace a few simple cases?

anirudh240 · 2026-05-14T09:38:05Z

Perhaps a more widely used dataset should be introduced to replace a few simple cases?

right but i feel theyre slightly an overkill considering these are just unit tests where i could just use simple controlled inputs

hzxuzhonghu

/lgtm

volcano-sh-bot · 2026-05-18T02:45:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/OWNERS~~ [hzxuzhonghu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

test: add tests for tokenizer package

8dacac7

Signed-off-by: Anirudh <2410030013@klh.edu.in>

Copilot AI review requested due to automatic review settings May 14, 2026 06:09

volcano-sh-bot added the kind/enhancement New feature or request label May 14, 2026

volcano-sh-bot requested review from YaoZengzeng and hzxuzhonghu May 14, 2026 06:09

volcano-sh-bot added the size/L label May 14, 2026

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread pkg/kthena-router/filters/tokenizer/tokenizer_test.go Outdated

Comment thread pkg/kthena-router/filters/tokenizer/tokenizer_test.go

Comment on lines +133 to +136

require.NoError(t, err)

assert.GreaterOrEqual(t, got, tt.wantMin)

assert.LessOrEqual(t, got, tt.wantMax)

}

fixing spelling errors in comments

b575a70

Signed-off-by: Anirudh <2410030013@klh.edu.in>

hzxuzhonghu approved these changes May 18, 2026

View reviewed changes

volcano-sh-bot assigned hzxuzhonghu May 18, 2026

volcano-sh-bot added the lgtm label May 18, 2026

volcano-sh-bot added the approved label May 18, 2026

volcano-sh-bot merged commit b510621 into volcano-sh:main May 18, 2026
16 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add tests for tokenizer package#1046

test: add tests for tokenizer package#1046
volcano-sh-bot merged 2 commits into
volcano-sh:mainfrom
anirudh240:test/add-tests-tokenizer

anirudh240 commented May 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

StLeoX commented May 14, 2026

Uh oh!

anirudh240 commented May 14, 2026

Uh oh!

hzxuzhonghu left a comment

Uh oh!

volcano-sh-bot commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

anirudh240 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

StLeoX commented May 14, 2026

Uh oh!

anirudh240 commented May 14, 2026

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

volcano-sh-bot commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

anirudh240 commented May 14, 2026 •

edited

Loading