Skip to content

High-performance tokenizer implementations in Go with unified CLI. Features Llama 3 tokenizer with exact compatibility, streaming support, and comprehensive tooling.

License

Notifications You must be signed in to change notification settings

agentstation/tokenizer

Repository files navigation

Tokenizer

GoDoc Go Report Card GitHub Workflow Status codecov License

A collection of high-performance tokenizer implementations in Go with a unified CLI interface.

Tokenizer CLI Demo

Installation

Using Homebrew (Recommended for macOS/Linux)

brew install agentstation/tap/tokenizer

Or tap the repository first:

brew tap agentstation/tap
brew install tokenizer

Download Binary

Download pre-built binaries from the releases page:

Platform Architecture File
Linux x86_64 tokenizer_VERSION_linux_x86_64.tar.gz
Linux ARM64 tokenizer_VERSION_linux_arm64.tar.gz
Linux ARMv6 tokenizer_VERSION_linux_armv6.tar.gz
Linux ARMv7 tokenizer_VERSION_linux_armv7.tar.gz
macOS Intel tokenizer_VERSION_darwin_x86_64.tar.gz
macOS Apple Silicon tokenizer_VERSION_darwin_arm64.tar.gz
Windows x86_64 tokenizer_VERSION_windows_x86_64.zip
FreeBSD x86_64 tokenizer_VERSION_freebsd_x86_64.tar.gz
FreeBSD ARM64 tokenizer_VERSION_freebsd_arm64.tar.gz

Using Go

go install github.com/agentstation/tokenizer/cmd/tokenizer@latest

Build from Source

# Clone the repository
git clone https://github.com/agentstation/tokenizer.git
cd tokenizer

# Build the binary
make build

# Or install directly
make install

CLI Tool

Quick usage:

# Encode text (implicit - recommended)
tokenizer llama3 "Hello, world!"

# Decode tokens
tokenizer llama3 decode 128000 9906 11 1917 0 128001

# Process large files (automatic pipe detection)
cat document.txt | tokenizer llama3

# Get tokenizer information
tokenizer llama3 info

See cmd/tokenizer/README.md for full CLI documentation.

Library Packages

llama3

A Go implementation of the Llama 3 tokenizer, providing exact compatibility with the official Llama 3 tokenization.

Features:

  • Byte-level BPE tokenization
  • Support for all 256 special tokens
  • UTF-8 handling for multilingual text
  • Compatible with Llama 3, 3.1, 3.2, and 3.3 models

See llama3/README.md for detailed usage.

Installation

go get github.com/agentstation/tokenizer/llama3

Quick Start

CLI Usage

# Install via Homebrew
brew install agentstation/tap/tokenizer

# Encode text (simple, intuitive)
tokenizer llama3 "Hello, world!"
# Output: 128000 9906 11 1917 0 128001

# Decode tokens
tokenizer llama3 decode 128000 9906 11 1917 0 128001
# Output: <|begin_of_text|>Hello, world!<|end_of_text|>

# Process from files (automatic)
cat document.txt | tokenizer llama3

# Get help
tokenizer llama3 help

Library Usage

package main

import (
    "fmt"
    "github.com/agentstation/tokenizer/llama3"
)

func main() {
    tokenizer, err := llama3.New()
    if err != nil {
        panic(err)
    }
    
    // Encode text to tokens
    tokens := tokenizer.Encode("Hello world!", nil)
    fmt.Printf("Tokens: %v\n", tokens)
    
    // Decode tokens back to text
    text := tokenizer.Decode(tokens)
    fmt.Printf("Text: %s\n", text)
}

Development

Prerequisites

  • Go 1.24.5 or later
  • Devbox (optional, for consistent development environment)

Setup Development Environment

# Using Devbox (recommended)
make devbox

# Or install dependencies manually
make deps

Common Development Tasks

# Run tests
make test

# Run benchmarks
make bench

# Generate coverage report
make coverage

# Run linter
make lint

# Format code
make fmt

# Build for all platforms
make build-all

# Generate documentation
make generate

Release Process

  1. Create and push a new tag:

    make tag VERSION=v1.0.0
    git push origin v1.0.0
  2. The GitHub Actions workflow will automatically:

    • Run tests
    • Build binaries for all platforms
    • Create a GitHub release with changelog
    • Upload binaries and checksums

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT

tokenizer

import "github.com/agentstation/tokenizer"

Package tokenizer provides a collection of high-performance tokenizer implementations.

Index

Generated by gomarkdoc

About

High-performance tokenizer implementations in Go with unified CLI. Features Llama 3 tokenizer with exact compatibility, streaming support, and comprehensive tooling.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •