Wikipedia Content Extractor

A TypeScript application that downloads Wikipedia pages, extracts content organized by headings (H1, H2, H3, etc.), and stores the structured content in Supabase.

Features

🔍 Web Scraping: Downloads Wikipedia pages with proper headers and error handling
📝 HTML Parsing: Extracts content under each heading using Cheerio
🗄️ Supabase Storage: Stores structured content in PostgreSQL via Supabase
🛡️ Environment Security: Uses environment variables for sensitive configuration
📊 Modular Design: Separate components for downloading, parsing, and storage
🎯 Command Line Interface: Easy-to-use CLI for extracting Wikipedia content

Prerequisites

Node.js (v16 or higher)
npm or yarn
A Supabase project with PostgreSQL database

Installation

Clone or navigate to the project directory:
```
cd vector-search
```
Install dependencies:
```
npm install
```

Set up environment variables:

cp env.example .env

Edit .env and add your Supabase credentials:

SUPABASE_URL=your_supabase_project_url_here
SUPABASE_ANON_KEY=your_supabase_anon_key_here
DB_TABLE_NAME=wikipedia_content

Set up the database table:
- Go to your Supabase dashboard
- Navigate to the SQL Editor
- Run the contents of supabase-setup.sql to create the required table

Usage

Basic Usage

Extract content from a Wikipedia page:

npm start https://en.wikipedia.org/wiki/TypeScript

Examples

# Extract TypeScript page
npm start https://en.wikipedia.org/wiki/TypeScript

# Extract Python page
npm start https://en.wikipedia.org/wiki/Python_(programming_language)

# Extract JavaScript page
npm start https://en.wikipedia.org/wiki/JavaScript

Development Mode

Run in watch mode for development:

npm run dev https://en.wikipedia.org/wiki/TypeScript

Project Structure

vector-search/
├── src/
│   ├── config/
│   │   └── environment.ts          # Environment configuration
│   ├── services/
│   │   ├── webDownloader.ts        # Web page downloader
│   │   ├── htmlParser.ts           # HTML content parser
│   │   ├── supabaseStorage.ts      # Supabase storage service
│   │   └── wikipediaExtractor.ts   # Main orchestrator
│   └── index.ts                    # CLI entry point
├── supabase-setup.sql              # Database schema
├── env.example                     # Environment variables template
└── package.json

How It Works

Download: The WebDownloader fetches the Wikipedia page using axios with proper headers
Parse: The HtmlParser uses Cheerio to extract content under each heading (H1-H6)
Store: The SupabaseStorage saves the structured content to PostgreSQL
Organize: Content is grouped by headings, with text content following each heading until the next heading of the same or higher level

Database Schema

The application creates a wikipedia_content table with the following structure:

Column	Type	Description
`id`	UUID	Primary key
`page_url`	TEXT	Wikipedia page URL
`page_title`	TEXT	Page title
`heading`	TEXT	Heading text
`heading_level`	INTEGER	Heading level (1-6)
`heading_id`	TEXT	HTML id attribute (optional)
`content`	TEXT	Content under the heading
`created_at`	TIMESTAMP	Creation timestamp

Error Handling

The application provides detailed error reporting for:

Invalid Wikipedia URLs
Network/download failures
HTML parsing errors
Database storage issues
Missing environment variables

Security Features

Environment variables for sensitive configuration
Proper user agent headers to avoid blocking
Input validation for Wikipedia URLs
Error handling without exposing sensitive information

Troubleshooting

Common Issues

"Missing required environment variables"
- Ensure you've created a .env file with your Supabase credentials
- Check that SUPABASE_URL and SUPABASE_ANON_KEY are set correctly
"Database error"
- Verify the wikipedia_content table exists in your Supabase database
- Run the supabase-setup.sql script in your Supabase SQL editor
- Check your Supabase project permissions
"Invalid Wikipedia URL"
- Ensure the URL points to a Wikipedia page (contains 'wikipedia.org')
- Use the full URL including the protocol (https://)
"Network error"
- Check your internet connection
- Verify the Wikipedia page is accessible
- Some pages might be rate-limited; try again later

Getting Supabase Credentials

Go to supabase.com and create a project
Navigate to Settings → API
Copy the "Project URL" and "anon public" key
Add them to your .env file

Contributing

This is a modular design that can be easily extended:

Add new parsers for different content types
Implement additional storage backends
Add content filtering or processing
Extend the CLI with additional options

License

This project is part of the vector-search repository.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
test		test
.gitignore		.gitignore
README.md		README.md
env.example		env.example
package-lock.json		package-lock.json
package.json		package.json
supabase-setup.sql		supabase-setup.sql
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia Content Extractor

Features

Prerequisites

Installation

Usage

Basic Usage

Examples

Development Mode

Project Structure

How It Works

Database Schema

Error Handling

Security Features

Troubleshooting

Common Issues

Getting Supabase Credentials

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Content Extractor

Features

Prerequisites

Installation

Usage

Basic Usage

Examples

Development Mode

Project Structure

How It Works

Database Schema

Error Handling

Security Features

Troubleshooting

Common Issues

Getting Supabase Credentials

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages