A collection of high-performance tokenizer implementations in Go with a unified CLI interface.
brew install agentstation/tap/tokenizer
Or tap the repository first:
brew tap agentstation/tap
brew install tokenizer
Download pre-built binaries from the releases page:
Platform | Architecture | File |
---|---|---|
Linux | x86_64 | tokenizer_VERSION_linux_x86_64.tar.gz |
Linux | ARM64 | tokenizer_VERSION_linux_arm64.tar.gz |
Linux | ARMv6 | tokenizer_VERSION_linux_armv6.tar.gz |
Linux | ARMv7 | tokenizer_VERSION_linux_armv7.tar.gz |
macOS | Intel | tokenizer_VERSION_darwin_x86_64.tar.gz |
macOS | Apple Silicon | tokenizer_VERSION_darwin_arm64.tar.gz |
Windows | x86_64 | tokenizer_VERSION_windows_x86_64.zip |
FreeBSD | x86_64 | tokenizer_VERSION_freebsd_x86_64.tar.gz |
FreeBSD | ARM64 | tokenizer_VERSION_freebsd_arm64.tar.gz |
go install github.com/agentstation/tokenizer/cmd/tokenizer@latest
# Clone the repository
git clone https://github.com/agentstation/tokenizer.git
cd tokenizer
# Build the binary
make build
# Or install directly
make install
Quick usage:
# Encode text (implicit - recommended)
tokenizer llama3 "Hello, world!"
# Decode tokens
tokenizer llama3 decode 128000 9906 11 1917 0 128001
# Process large files (automatic pipe detection)
cat document.txt | tokenizer llama3
# Get tokenizer information
tokenizer llama3 info
See cmd/tokenizer/README.md for full CLI documentation.
A Go implementation of the Llama 3 tokenizer, providing exact compatibility with the official Llama 3 tokenization.
Features:
- Byte-level BPE tokenization
- Support for all 256 special tokens
- UTF-8 handling for multilingual text
- Compatible with Llama 3, 3.1, 3.2, and 3.3 models
See llama3/README.md for detailed usage.
go get github.com/agentstation/tokenizer/llama3
# Install via Homebrew
brew install agentstation/tap/tokenizer
# Encode text (simple, intuitive)
tokenizer llama3 "Hello, world!"
# Output: 128000 9906 11 1917 0 128001
# Decode tokens
tokenizer llama3 decode 128000 9906 11 1917 0 128001
# Output: <|begin_of_text|>Hello, world!<|end_of_text|>
# Process from files (automatic)
cat document.txt | tokenizer llama3
# Get help
tokenizer llama3 help
package main
import (
"fmt"
"github.com/agentstation/tokenizer/llama3"
)
func main() {
tokenizer, err := llama3.New()
if err != nil {
panic(err)
}
// Encode text to tokens
tokens := tokenizer.Encode("Hello world!", nil)
fmt.Printf("Tokens: %v\n", tokens)
// Decode tokens back to text
text := tokenizer.Decode(tokens)
fmt.Printf("Text: %s\n", text)
}
- Go 1.24.5 or later
- Devbox (optional, for consistent development environment)
# Using Devbox (recommended)
make devbox
# Or install dependencies manually
make deps
# Run tests
make test
# Run benchmarks
make bench
# Generate coverage report
make coverage
# Run linter
make lint
# Format code
make fmt
# Build for all platforms
make build-all
# Generate documentation
make generate
-
Create and push a new tag:
make tag VERSION=v1.0.0 git push origin v1.0.0
-
The GitHub Actions workflow will automatically:
- Run tests
- Build binaries for all platforms
- Create a GitHub release with changelog
- Upload binaries and checksums
Contributions are welcome! Please feel free to submit a Pull Request.
MIT
import "github.com/agentstation/tokenizer"
Package tokenizer provides a collection of high-performance tokenizer implementations.
Generated by gomarkdoc