A TypeScript application that downloads Wikipedia pages, extracts content organized by headings (H1, H2, H3, etc.), and stores the structured content in Supabase.
- 🔍 Web Scraping: Downloads Wikipedia pages with proper headers and error handling
- 📝 HTML Parsing: Extracts content under each heading using Cheerio
- 🗄️ Supabase Storage: Stores structured content in PostgreSQL via Supabase
- 🛡️ Environment Security: Uses environment variables for sensitive configuration
- 📊 Modular Design: Separate components for downloading, parsing, and storage
- 🎯 Command Line Interface: Easy-to-use CLI for extracting Wikipedia content
- Node.js (v16 or higher)
- npm or yarn
- A Supabase project with PostgreSQL database
-
Clone or navigate to the project directory:
cd vector-search -
Install dependencies:
npm install
-
Set up environment variables:
cp env.example .env
Edit
.envand add your Supabase credentials:SUPABASE_URL=your_supabase_project_url_here SUPABASE_ANON_KEY=your_supabase_anon_key_here DB_TABLE_NAME=wikipedia_content
-
Set up the database table:
- Go to your Supabase dashboard
- Navigate to the SQL Editor
- Run the contents of
supabase-setup.sqlto create the required table
Extract content from a Wikipedia page:
npm start https://en.wikipedia.org/wiki/TypeScript# Extract TypeScript page
npm start https://en.wikipedia.org/wiki/TypeScript
# Extract Python page
npm start https://en.wikipedia.org/wiki/Python_(programming_language)
# Extract JavaScript page
npm start https://en.wikipedia.org/wiki/JavaScriptRun in watch mode for development:
npm run dev https://en.wikipedia.org/wiki/TypeScriptvector-search/
├── src/
│ ├── config/
│ │ └── environment.ts # Environment configuration
│ ├── services/
│ │ ├── webDownloader.ts # Web page downloader
│ │ ├── htmlParser.ts # HTML content parser
│ │ ├── supabaseStorage.ts # Supabase storage service
│ │ └── wikipediaExtractor.ts # Main orchestrator
│ └── index.ts # CLI entry point
├── supabase-setup.sql # Database schema
├── env.example # Environment variables template
└── package.json
- Download: The
WebDownloaderfetches the Wikipedia page using axios with proper headers - Parse: The
HtmlParseruses Cheerio to extract content under each heading (H1-H6) - Store: The
SupabaseStoragesaves the structured content to PostgreSQL - Organize: Content is grouped by headings, with text content following each heading until the next heading of the same or higher level
The application creates a wikipedia_content table with the following structure:
| Column | Type | Description |
|---|---|---|
id |
UUID | Primary key |
page_url |
TEXT | Wikipedia page URL |
page_title |
TEXT | Page title |
heading |
TEXT | Heading text |
heading_level |
INTEGER | Heading level (1-6) |
heading_id |
TEXT | HTML id attribute (optional) |
content |
TEXT | Content under the heading |
created_at |
TIMESTAMP | Creation timestamp |
The application provides detailed error reporting for:
- Invalid Wikipedia URLs
- Network/download failures
- HTML parsing errors
- Database storage issues
- Missing environment variables
- Environment variables for sensitive configuration
- Proper user agent headers to avoid blocking
- Input validation for Wikipedia URLs
- Error handling without exposing sensitive information
-
"Missing required environment variables"
- Ensure you've created a
.envfile with your Supabase credentials - Check that
SUPABASE_URLandSUPABASE_ANON_KEYare set correctly
- Ensure you've created a
-
"Database error"
- Verify the
wikipedia_contenttable exists in your Supabase database - Run the
supabase-setup.sqlscript in your Supabase SQL editor - Check your Supabase project permissions
- Verify the
-
"Invalid Wikipedia URL"
- Ensure the URL points to a Wikipedia page (contains 'wikipedia.org')
- Use the full URL including the protocol (https://)
-
"Network error"
- Check your internet connection
- Verify the Wikipedia page is accessible
- Some pages might be rate-limited; try again later
- Go to supabase.com and create a project
- Navigate to Settings → API
- Copy the "Project URL" and "anon public" key
- Add them to your
.envfile
This is a modular design that can be easily extended:
- Add new parsers for different content types
- Implement additional storage backends
- Add content filtering or processing
- Extend the CLI with additional options
This project is part of the vector-search repository.