Tokenizer

A collection of high-performance tokenizer implementations in Go with a unified CLI interface.

Installation

Using Homebrew (Recommended for macOS/Linux)

brew install agentstation/tap/tokenizer

Or tap the repository first:

brew tap agentstation/tap
brew install tokenizer

Download Binary

Download pre-built binaries from the releases page:

Platform	Architecture	File
Linux	x86_64	`tokenizer_VERSION_linux_x86_64.tar.gz`
Linux	ARM64	`tokenizer_VERSION_linux_arm64.tar.gz`
Linux	ARMv6	`tokenizer_VERSION_linux_armv6.tar.gz`
Linux	ARMv7	`tokenizer_VERSION_linux_armv7.tar.gz`
macOS	Intel	`tokenizer_VERSION_darwin_x86_64.tar.gz`
macOS	Apple Silicon	`tokenizer_VERSION_darwin_arm64.tar.gz`
Windows	x86_64	`tokenizer_VERSION_windows_x86_64.zip`
FreeBSD	x86_64	`tokenizer_VERSION_freebsd_x86_64.tar.gz`
FreeBSD	ARM64	`tokenizer_VERSION_freebsd_arm64.tar.gz`

Using Go

go install github.com/agentstation/tokenizer/cmd/tokenizer@latest

Build from Source

# Clone the repository
git clone https://github.com/agentstation/tokenizer.git
cd tokenizer

# Build the binary
make build

# Or install directly
make install

CLI Tool

Quick usage:

# Encode text (implicit - recommended)
tokenizer llama3 "Hello, world!"

# Decode tokens
tokenizer llama3 decode 128000 9906 11 1917 0 128001

# Process large files (automatic pipe detection)
cat document.txt | tokenizer llama3

# Get tokenizer information
tokenizer llama3 info

See cmd/tokenizer/README.md for full CLI documentation.

Library Packages

llama3

A Go implementation of the Llama 3 tokenizer, providing exact compatibility with the official Llama 3 tokenization.

Features:

Byte-level BPE tokenization
Support for all 256 special tokens
UTF-8 handling for multilingual text
Compatible with Llama 3, 3.1, 3.2, and 3.3 models

See llama3/README.md for detailed usage.

Installation

go get github.com/agentstation/tokenizer/llama3

Quick Start

CLI Usage

# Install via Homebrew
brew install agentstation/tap/tokenizer

# Encode text (simple, intuitive)
tokenizer llama3 "Hello, world!"
# Output: 128000 9906 11 1917 0 128001

# Decode tokens
tokenizer llama3 decode 128000 9906 11 1917 0 128001
# Output: <|begin_of_text|>Hello, world!<|end_of_text|>

# Process from files (automatic)
cat document.txt | tokenizer llama3

# Get help
tokenizer llama3 help

Library Usage

package main

import (
    "fmt"
    "github.com/agentstation/tokenizer/llama3"
)

func main() {
    tokenizer, err := llama3.New()
    if err != nil {
        panic(err)
    }
    
    // Encode text to tokens
    tokens := tokenizer.Encode("Hello world!", nil)
    fmt.Printf("Tokens: %v\n", tokens)
    
    // Decode tokens back to text
    text := tokenizer.Decode(tokens)
    fmt.Printf("Text: %s\n", text)
}

Development

Prerequisites

Go 1.24.5 or later
Devbox (optional, for consistent development environment)

Setup Development Environment

# Using Devbox (recommended)
make devbox

# Or install dependencies manually
make deps

Common Development Tasks

# Run tests
make test

# Run benchmarks
make bench

# Generate coverage report
make coverage

# Run linter
make lint

# Format code
make fmt

# Build for all platforms
make build-all

# Generate documentation
make generate

Release Process

Create and push a new tag:

make tag VERSION=v1.0.0
git push origin v1.0.0

The GitHub Actions workflow will automatically:
- Run tests
- Build binaries for all platforms
- Create a GitHub release with changelog
- Upload binaries and checksums

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT

tokenizer

import "github.com/agentstation/tokenizer"

Package tokenizer provides a collection of high-performance tokenizer implementations.

Index

Generated by gomarkdoc

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github		.github
cmd		cmd
docs		docs
llama3		llama3
scripts		scripts
.air.toml		.air.toml
.env.example		.env.example
.gitignore		.gitignore
.godot.yaml		.godot.yaml
.golangci.yml		.golangci.yml
.goreleaser.yaml		.goreleaser.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASING.md		RELEASING.md
codecov.yml		codecov.yml
devbox.json		devbox.json
devbox.lock		devbox.lock
generate.go		generate.go
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tokenizer

Installation

Using Homebrew (Recommended for macOS/Linux)

Download Binary

Using Go

Build from Source

CLI Tool

Library Packages

llama3

Installation

Quick Start

CLI Usage

Library Usage

Development

Prerequisites

Setup Development Environment

Common Development Tasks

Release Process

Contributing

License

tokenizer

Index

About

Uh oh!

Releases 7

Packages

Contributors 3

Uh oh!

Languages

License

agentstation/tokenizer

Folders and files

Latest commit

History

Repository files navigation

Tokenizer

Installation

Using Homebrew (Recommended for macOS/Linux)

Download Binary

Using Go

Build from Source

CLI Tool

Library Packages

llama3

Installation

Quick Start

CLI Usage

Library Usage

Development

Prerequisites

Setup Development Environment

Common Development Tasks

Release Process

Contributing

License

tokenizer

Index

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 3

Uh oh!

Languages

Packages