🎙️ Gemini ASR Transcription Tool

English | 简体中文 | 繁體中文

A Python tool that uses Google Gemini API to transcribe video or audio files into SRT subtitle files.

✨ Features

🎥 Supports various video (mp4, avi, mkv) and audio (mp3, wav) formats.
✂️ Automatically splits long files into smaller chunks for processing.
🧵 Uses multi-threading for parallel processing to speed up transcription.
🔄 Optionally rotates between multiple Google API Keys to improve request success rate.
⏱️ Generates SRT subtitle files with precise timestamps (millisecond accuracy).
🎬 Option to clip specific time segments of videos or audio for transcription.
📄 Option to save original transcription text returned by Gemini API.
💬 Supports passing additional prompts (text or file path) to guide the transcription model.
📁 Can process a single file or all supported files in an entire directory.
⏩ Provides --skip-existing option to avoid reprocessing files that already have subtitles.
🐞 Supports DEBUG mode for detailed logging output.
🌈 Uses colored logs for easy distinction between different message levels.
🔗 Supports proxy or custom server side like gemini-balance.
- If you want to use gemini-balance, you need to set the BASE_URL environment variable to https://your-custom-url.com/.
- Note: code execution should be closed if you use gemini-balance.
⚙️ TOML Configuration Support: Comprehensive configuration management system with multiple sources.
- 📝 Configuration file support with automatic search in multiple locations
- 🔄 Multi-source configuration merging (CLI > TOML > Environment > Defaults)
- 🎛️ Easy management of all settings in a single configuration file

🔧 Installation

🛠️ Environment Setup

Install Python: Python 3.8 or higher is recommended.
Install uv: If you haven't installed uv yet, please refer to the uv official documentation for installation. uv is an extremely fast Python package installer and manager.

📦 Dependencies

Simply use uv run to automatically install and run the script:

uv run gemini_asr.py -i video.mp4

This will automatically install all required dependencies and execute the script.

🔑 API Keys Configuration

Get Google API Key: Go to Google AI Studio to obtain your API key. You can get multiple keys to improve processing efficiency.

Configuration Methods (choose one):

Option A: TOML Configuration File (Recommended)

# Copy the example configuration file
cp config.example.toml config.toml

# Edit config.toml and add your API keys
nano config.toml

In config.toml:

[api]
google_api_keys = ["YOUR_API_KEY_1", "YOUR_API_KEY_2", "YOUR_API_KEY_3"]

Option B: Environment Variables

# Set environment variable with comma-separated keys
export GOOGLE_API_KEY=YOUR_API_KEY_1,YOUR_API_KEY_2,YOUR_API_KEY_3

Option C: .env File

# Create .env file in project root
echo "GOOGLE_API_KEY=YOUR_API_KEY_1,YOUR_API_KEY_2,YOUR_API_KEY_3" > .env

⚙️ Configuration System

GeminiASR supports a flexible configuration system with the following priority order:

Command-line arguments (highest priority)
TOML configuration file
Environment variables
Default values (lowest priority)

Configuration File Locations (searched in order):

./config.toml (current directory)
./.geminiasr/config.toml
~/.geminiasr/config.toml
~/.config/geminiasr/config.toml

Example Configuration (config.toml):

# Transcription Settings
[transcription]
duration = 900           # Segment duration in seconds
lang = "zh-TW"          # Language code
model = "gemini-2.5-flash"  # Gemini model
skip_existing = true     # Skip files with existing SRT
max_segment_retries = 3  # Max retries per segment

# Processing Settings
[processing]
max_workers = 24         # Max concurrent threads
ignore_keys_limit = true # Ignore API key limits

# Logging Settings
[logging]
debug = true            # Enable debug logging

# API Settings
[api]
google_api_keys = ["key1", "key2", "key3"]

# Advanced Settings
[advanced]
extra_prompt = "prompt.md"  # Path to prompt file
base_url = "https://generativelanguage.googleapis.com/"

📋 Usage

⌨️ Command Line Arguments

python gemini_asr.py [-h] -i INPUT [-d DURATION] [-l LANG] [-m MODEL]
                      [--start START] [--end END] [--save-raw]
                      [--skip-existing] [--debug]
                      [--max-workers MAX_WORKERS]
                      [--extra-prompt EXTRA_PROMPT]
                      [--ignore-keys-limit]

arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input video, audio file, or folder containing media files
  -d DURATION, --duration DURATION
                        Duration of each segment in seconds (default: 900)
  -l LANG, --lang LANG  Language code (default: zh-TW)
  -m MODEL, --model MODEL
                        Gemini model (default: gemini-2.5-pro-exp-03-25)
  --start START         Start time in seconds
  --end END             End time in seconds
  --save-raw            Save raw transcription results
  --skip-existing       Skip processing if SRT subtitle file already exists
  --debug               Enable DEBUG level logging
  --max-workers MAX_WORKERS
                        Maximum number of worker threads (default: cannot exceed the number of GOOGLE_API_KEYS)
  --extra-prompt EXTRA_PROMPT
                        Additional prompt or path to a file containing prompts
  --ignore-keys-limit   Ignore the API key quantity limit on maximum worker threads

💡 Usage Examples

Using TOML Configuration (Recommended):

# All settings from config.toml - just specify input
uv run gemini_asr.py -i video.mp4

# Process entire directory with TOML settings
uv run gemini_asr.py -i /path/to/media/folder

Traditional Command-Line Usage:

# Basic transcription
uv run gemini_asr.py -i video.mp4

# With custom settings
uv run gemini_asr.py -i video.mp4 -d 300 --debug

🔍 Technical Details About Audio Processing

Note

The old default model (gemini-2.5-pro) is free but has some limits. Now the default model is gemini-2.5-flash.

🧮 Token Usage: Gemini uses 32 tokens per second of audio (1,920 tokens/minute). For more details on audio processing capabilities, see Gemini Audio Documentation.
📈 Output Tokens: Gemini 2.5 Pro/Flash has a limit of 65,536 output tokens per request, which affects the maximum duration of processable audio. See Gemini Models Documentation for details.
📊 Rate Limits: The default model (gemini-2.5-pro) is free during the preview period but subject to specific limits: 250,000 TPM (tokens per minute), 5 RPM (requests per minute) and 100 RPD (requests per day). See Rate Limits Documentation for details.
💰 Pricing: Paid tier costs $1.25 per million tokens (≤200k tokens) or $2.50 per million tokens (>200k tokens). For audio longer than 2 hours, it is recommended to split the file to avoid excessive token usage and potential cost overruns. See Gemini Developer API Pricing for complete pricing information.

📝 Notes

🔑 Configuration Best Practices

TOML Configuration: Use config.toml for persistent settings. This is the recommended approach for regular use.
Command-line Override: Use CLI arguments to temporarily override TOML settings for specific runs.
Multiple API Keys: Configure multiple API keys in TOML or environment variables for better performance.

⚡ Performance Optimization

For optimal performance, consider:
- 🔑 Using multiple API keys (configured in config.toml or environment variables)
- ⏱️ Adjusting the segment duration based on content complexity—keep segments under 60 minutes to avoid output token limits and stay within TPM constraints for free tier users
- 🧵 Configuring max-workers appropriately in TOML based on your system's capabilities and the number of available API keys
- 🚫 The ignore_keys_limit setting should be used cautiously and primarily by paid tier users to avoid hitting the strict TPM limits of the free tier

🚨 Troubleshooting

⚠️ If you encounter 429 (Too Many Requests) errors, try reducing the max_workers setting in config.toml, adding more API keys, or upgrading to the paid tier
💲 Paid tier users should use the gemini-2.5-pro model and can safely enable the ignore_keys_limit option due to higher TPM allowances
🐛 Use debug = true in config.toml or --debug flag to see detailed configuration and processing information

💰 Cost Management

The script uses the Gemini API, which has usage limits and may incur costs beyond the free tier.
Monitor your token usage through the detailed configuration logging when debug mode is enabled.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
utils		utils
.env.sample		.env.sample
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
README.zh-CN.md		README.zh-CN.md
README.zh-TW.md		README.zh-TW.md
config.example.toml		config.example.toml
gemini_asr.py		gemini_asr.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎙️ Gemini ASR Transcription Tool

✨ Features

🔧 Installation

🛠️ Environment Setup

📦 Dependencies

🔑 API Keys Configuration

⚙️ Configuration System

📋 Usage

⌨️ Command Line Arguments

💡 Usage Examples

🔍 Technical Details About Audio Processing

📝 Notes

🔑 Configuration Best Practices

⚡ Performance Optimization

🚨 Troubleshooting

💰 Cost Management

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

cxyfer/GeminiASR

Folders and files

Latest commit

History

Repository files navigation

🎙️ Gemini ASR Transcription Tool

✨ Features

🔧 Installation

🛠️ Environment Setup

📦 Dependencies

🔑 API Keys Configuration

⚙️ Configuration System

📋 Usage

⌨️ Command Line Arguments

💡 Usage Examples

🔍 Technical Details About Audio Processing

📝 Notes

🔑 Configuration Best Practices

⚡ Performance Optimization

🚨 Troubleshooting

💰 Cost Management

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages