A powerful Node.js tool to download audio from YouTube, Spotify, or podcast URLs, transcribe using Google Gemini AI, and generate comprehensive summaries with speaker identification and tone analysis.
- π₯ Download audio from YouTube videos, Spotify episodes/podcasts, or direct MP3 URLs
- βοΈ Smart chunking - Splits long audio into manageable 10-minute segments
- π― Advanced transcription using Google Gemini AI with:
- Speaker identification
- Tone/emotion analysis
- Timestamp preservation
- π Intelligent merging of transcription chunks
- β¨ Automatic extraction of:
- Key highlights and themes
- Comprehensive summary
- Speaker statistics
- π Multiple output formats:
- Structured JSON
- Formatted text transcript
- Detailed metadata report
-
Node.js (v16 or higher)
-
pnpm - Fast, disk space efficient package manager
npm install -g pnpm
-
yt-dlp - For YouTube downloads
# macOS brew install yt-dlp # Ubuntu/Debian sudo apt install yt-dlp # Windows # Download from https://github.com/yt-dlp/yt-dlp/releases
Note: Spotify support is handled automatically through the integrated
spotify-dl
package - no additional installation required. -
ffmpeg - For audio processing
# macOS brew install ffmpeg # Ubuntu/Debian sudo apt install ffmpeg # Windows # Download from https://ffmpeg.org/download.html
-
Google Gemini API Key
- Get your API key from: https://makersuite.google.com/app/apikey
# Clone the repository
git clone <repository-url>
cd audio-transcriber
# Install dependencies
pnpm install
# Copy environment file and add your API key
cp .env.example .env
# Edit .env and add your GEMINI_API_KEY
- Videos, playlists, channels
- Automatic audio extraction
- Metadata preservation
- Episodes and podcasts
- Automatically finds matching content on YouTube
- Preserves original metadata and structure
- No authentication required for most content
- Any publicly accessible MP3 file
- Direct download without conversion
# Transcribe a YouTube video
pnpm dev "https://www.youtube.com/watch?v=VIDEO_ID"
# Transcribe a Spotify episode/podcast
pnpm dev "https://open.spotify.com/episode/EPISODE_ID"
# Transcribe a direct MP3 URL
pnpm dev "https://example.com/podcast.mp3"
# With custom output path
pnpm dev "https://www.youtube.com/watch?v=VIDEO_ID" -o ./my-transcript.json
audio-transcriber <url> [options]
Options:
-o, --output <path> Output file path (default: ./output/transcript_[timestamp].json)
-t, --temp-dir <path> Temporary directory for processing (default: ./temp)
-c, --chunk-duration <secs> Duration of each chunk in seconds (default: 600)
--concurrency <number> Number of chunks to process in parallel during transcription (default: 5)
-k, --keep-chunks Keep temporary audio chunks after processing
-s, --save-temp-files Keep all temporary files including raw audio, downsampled audio, chunks, and intermediate files
--no-text Skip generating text transcript file
--no-report Skip generating metadata report file
-h, --help Display help for command
# Display dependency information
pnpm dev info
# Run a test transcription
pnpm dev test
# Build the project
pnpm build
# Clean temporary files
pnpm clean
The tool generates three types of output files:
{
"title": "Video/Audio Title",
"source_url": "https://...",
"full_transcript": [
{
"start": "00:00:00",
"end": "00:00:45",
"speaker": "Speaker 1",
"tone": "Excited",
"text": "Transcribed text..."
}
],
"highlights": ["Key point 1", "Key point 2"],
"summary": "Comprehensive summary..."
}
A formatted, readable transcript with timestamps, speakers, and tone information.
A summary report containing:
- Title and source information
- Executive summary
- Key highlights
- Speaker statistics
- Tone distribution analysis
You can run Interview Transcriber in a Docker container for easy, reproducible usage.
docker build -t interview-transcriber .
# YouTube video
docker run --rm \
-e GEMINI_API_KEY=your_actual_api_key \
-v $(pwd)/output:/output \
interview-transcriber "https://www.youtube.com/watch?v=VIDEO_ID"
# Spotify episode
docker run --rm \
-e GEMINI_API_KEY=your_actual_api_key \
-v $(pwd)/output:/output \
interview-transcriber "https://open.spotify.com/episode/EPISODE_ID"
- This will save the transcript in your local
output/
directory. - You can also specify a custom output file:
docker run --rm \
-e GEMINI_API_KEY=your_actual_api_key \
-v $(pwd)/output:/output \
interview-transcriber "https://www.youtube.com/watch?v=VIDEO_ID" /output/my-transcript.json
Instead of specifying the API key directly, you can store it in a .env
file:
# .env
GEMINI_API_KEY=your_actual_api_key
Then run the container with:
# YouTube video
docker run --rm \
--env-file .env \
-v $(pwd)/output:/output \
interview-transcriber "https://www.youtube.com/watch?v=VIDEO_ID"
# Spotify episode
docker run --rm \
--env-file .env \
-v $(pwd)/output:/output \
interview-transcriber "https://open.spotify.com/episode/EPISODE_ID"
- The
GEMINI_API_KEY
environment variable is required for Google Gemini transcription. - The
/output
directory inside the container should be mounted to a local directory to access results. - All other CLI options are supported as in the native usage.
Running yt-dlp from Fly.io (or most cloud/DC IP ranges) can trigger YouTube anti-bot and consent checks. To improve reliability, this project supports the following env vars (set them as Fly secrets):
YTDLP_FORCE_IPV4
: true to force IPv4. Some IPv6 pools are scrutinized.YTDLP_PROXY
: HTTP/SOCKS proxy (residential/backconnect recommended), e.g.http://user:pass@host:port
.YTDLP_PLAYER_CLIENT
: defaults toweb
(more reliable than android). Options:web
,android
,ios
,tv
.YTDLP_EXTRACTOR_ARGS
: override extractor args (only if explicitly needed).YTDLP_USER_AGENT
: override UA; defaults to a standard web browser UA.YTDLP_GEO_BYPASS_COUNTRY
: e.g.US
.YTDLP_RETRIES
/YTDLP_FRAGMENT_RETRIES
: retry counts (defaults 3 / 3).YTDLP_SLEEP_REQUESTS
/YTDLP_MAX_SLEEP_REQUESTS
: add randomized delays between requests.YTDLP_COOKIES_BASE64
: base64-encoded Netscape cookie file. Written to/data/cookies.txt
.YTDLP_DISABLE_COOKIE_REFRESH
: set totrue
on Fly to disable Playwright-based cookie refresh inside the container.
Option 1: Ultra-Simple Proxy-Only Approach (Recommended) The residential proxy alone has been tested and successfully bypasses YouTube's bot detection. This is the minimal configuration that works:
# Set up residential proxy (tested and working)
fly secrets set YTDLP_PROXY=http://9YEi8p9D0pR2o3q2:[email protected]:12321
That's it! No other configuration needed. yt-dlp handles format selection and quality automatically.
Option 2: Cookie-Based Approach If you prefer to use cookies or need them for private/age-restricted content:
# Disable in-container cookie refresh and force IPv4:
fly secrets set YTDLP_DISABLE_COOKIE_REFRESH=true YTDLP_FORCE_IPV4=true
# Provide cookies from outside the container (export locally, base64, then set):
fly secrets set YTDLP_COOKIES_BASE64=$(base64 -i cookies.txt)
Option 3: Combined Approach For maximum reliability, combine both proxy and cookies:
# Set both proxy and cookies
fly secrets set YTDLP_PROXY=http://9YEi8p9D0pR2o3q2:[email protected]:12321
fly secrets set YTDLP_COOKIES_BASE64=$(base64 -i cookies.txt)
fly secrets set YTDLP_FORCE_IPV4=true YTDLP_RETRIES=3 YTDLP_FRAGMENT_RETRIES=3
- Proxy-Only Success: Testing shows that the residential proxy alone successfully bypasses YouTube's bot detection for public videos
- IP Reputation: Residential proxies provide better IP reputation than cloud/datacenter IPs
- Private Content: For private/age-restricted content, cookies from an authenticated session are still required
- Fallback: Consider offering a manual upload fallback if direct download fails repeatedly
You can also use the modules programmatically:
import { AudioProcessor } from "audio-transcriber";
const processor = new AudioProcessor();
const options = {
url: "https://www.youtube.com/watch?v=VIDEO_ID",
outputPath: "./output/my-transcript.json",
chunkDuration: 600, // 10 minutes
concurrency: 5, // Process 5 chunks in parallel
};
const result = await processor.processAudio(options);
console.log(result);
The system automatically manages YouTube cookies using Playwright to bypass bot detection. Here's how it works:
- Cookie Age Check: Cookies are refreshed when they're older than 60 minutes (configurable via
YTDLP_COOKIE_MAX_AGE_MINUTES
) - Playwright Automation: Uses headless Chrome to visit YouTube and collect fresh cookies
- Logging: All cookie operations are logged with timestamps and age information
When the system starts, you'll see logs like:
πͺ Checking cookie status...
π Cookie file: /data/cookies.txt (age: 45 minute(s))
β
Cookie file /data/cookies.txt is fresh (45 minute(s))
Or if cookies need refreshing:
πͺ Found cookie file: /data/cookies.txt (age: 2 hour(s))
β οΈ Cookie file /data/cookies.txt is stale (2 hour(s)), refreshing...
πͺ Starting cookie refresh process...
π Navigating to YouTube...
π Performing search to trigger cookie collection...
πͺ Collected 15 cookies from YouTube
β
Cookies written to /data/cookies.txt in Netscape format
# Cookie refresh interval (in minutes)
YTDLP_COOKIE_MAX_AGE_MINUTES=60
# Disable automatic cookie refresh (useful for Fly.io)
YTDLP_DISABLE_COOKIE_REFRESH=false
# Base64 encoded cookies (alternative to automatic refresh)
YTDLP_COOKIES_BASE64=
# Cookie file location
COOKIE_OUTPUT_PATH=/data/cookies.txt
You can test cookie functionality with:
node test-cookies.js
This will show:
- Cookie file location and age
- Environment variable configuration
- Whether cookies need refreshing
audio-transcriber/
βββ src/
β βββ modules/
β β βββ audioDownloader.ts # YouTube/MP3 download logic
β β βββ spotifyDownloader.ts # Spotify download logic
β β βββ audioChunker.ts # Audio splitting with ffmpeg
β β βββ transcriber.ts # Gemini AI transcription
β β βββ merger.ts # Chunk merging logic
β β βββ highlights.ts # Highlight extraction
β β βββ outputBuilder.ts # Output file generation
β βββ utils/
β β βββ timeUtils.ts # Timestamp utilities
β β βββ fileUtils.ts # File system utilities
β βββ types/
β β βββ index.ts # TypeScript interfaces
β βββ processor.ts # Main orchestrator
β βββ cli.ts # CLI interface
β βββ index.ts # Module exports
βββ tests/ # Test files
βββ temp/ # Temporary processing files
βββ output/ # Default output directory
βββ package.json
The tool includes comprehensive error handling for:
- Network failures (with retry logic)
- Invalid URLs
- API rate limiting
- File system errors
- Corrupted audio files
- Chunk Duration: Default is 10 minutes. Shorter chunks = more API calls but better accuracy
- API Rate Limiting: The tool includes delays between API calls to avoid rate limiting
- Parallel Processing: Chunks are processed in parallel with configurable concurrency (default: 5). Higher concurrency = faster processing but may hit API rate limits
- Concurrency Control: Use the
--concurrency
option to adjust parallel processing. Start with 5 for most use cases
-
"GEMINI_API_KEY not found"
- Make sure you've created a
.env
file with your API key
- Make sure you've created a
-
"yt-dlp not found"
- Install yt-dlp using the instructions above
-
"ffmpeg not found"
- Install ffmpeg using the instructions above
-
Transcription fails
- Check your Gemini API quota
- Try reducing chunk duration
- Ensure audio quality is sufficient
Contributions are welcome! Please feel free to submit a Pull Request.
MIT
- Google Gemini AI for transcription capabilities
- yt-dlp for YouTube download functionality
- spotify-dl by SwapnilSoni1999 for Spotify support
- ffmpeg for audio processing
- The open-source community for various dependencies