Skip to content

ckale-scorpio/vector-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wikipedia Content Extractor

A TypeScript application that downloads Wikipedia pages, extracts content organized by headings (H1, H2, H3, etc.), and stores the structured content in Supabase.

Features

  • 🔍 Web Scraping: Downloads Wikipedia pages with proper headers and error handling
  • 📝 HTML Parsing: Extracts content under each heading using Cheerio
  • 🗄️ Supabase Storage: Stores structured content in PostgreSQL via Supabase
  • 🛡️ Environment Security: Uses environment variables for sensitive configuration
  • 📊 Modular Design: Separate components for downloading, parsing, and storage
  • 🎯 Command Line Interface: Easy-to-use CLI for extracting Wikipedia content

Prerequisites

  • Node.js (v16 or higher)
  • npm or yarn
  • A Supabase project with PostgreSQL database

Installation

  1. Clone or navigate to the project directory:

    cd vector-search
  2. Install dependencies:

    npm install
  3. Set up environment variables:

    cp env.example .env

    Edit .env and add your Supabase credentials:

    SUPABASE_URL=your_supabase_project_url_here
    SUPABASE_ANON_KEY=your_supabase_anon_key_here
    DB_TABLE_NAME=wikipedia_content
  4. Set up the database table:

    • Go to your Supabase dashboard
    • Navigate to the SQL Editor
    • Run the contents of supabase-setup.sql to create the required table

Usage

Basic Usage

Extract content from a Wikipedia page:

npm start https://en.wikipedia.org/wiki/TypeScript

Examples

# Extract TypeScript page
npm start https://en.wikipedia.org/wiki/TypeScript

# Extract Python page
npm start https://en.wikipedia.org/wiki/Python_(programming_language)

# Extract JavaScript page
npm start https://en.wikipedia.org/wiki/JavaScript

Development Mode

Run in watch mode for development:

npm run dev https://en.wikipedia.org/wiki/TypeScript

Project Structure

vector-search/
├── src/
│   ├── config/
│   │   └── environment.ts          # Environment configuration
│   ├── services/
│   │   ├── webDownloader.ts        # Web page downloader
│   │   ├── htmlParser.ts           # HTML content parser
│   │   ├── supabaseStorage.ts      # Supabase storage service
│   │   └── wikipediaExtractor.ts   # Main orchestrator
│   └── index.ts                    # CLI entry point
├── supabase-setup.sql              # Database schema
├── env.example                     # Environment variables template
└── package.json

How It Works

  1. Download: The WebDownloader fetches the Wikipedia page using axios with proper headers
  2. Parse: The HtmlParser uses Cheerio to extract content under each heading (H1-H6)
  3. Store: The SupabaseStorage saves the structured content to PostgreSQL
  4. Organize: Content is grouped by headings, with text content following each heading until the next heading of the same or higher level

Database Schema

The application creates a wikipedia_content table with the following structure:

Column Type Description
id UUID Primary key
page_url TEXT Wikipedia page URL
page_title TEXT Page title
heading TEXT Heading text
heading_level INTEGER Heading level (1-6)
heading_id TEXT HTML id attribute (optional)
content TEXT Content under the heading
created_at TIMESTAMP Creation timestamp

Error Handling

The application provides detailed error reporting for:

  • Invalid Wikipedia URLs
  • Network/download failures
  • HTML parsing errors
  • Database storage issues
  • Missing environment variables

Security Features

  • Environment variables for sensitive configuration
  • Proper user agent headers to avoid blocking
  • Input validation for Wikipedia URLs
  • Error handling without exposing sensitive information

Troubleshooting

Common Issues

  1. "Missing required environment variables"

    • Ensure you've created a .env file with your Supabase credentials
    • Check that SUPABASE_URL and SUPABASE_ANON_KEY are set correctly
  2. "Database error"

    • Verify the wikipedia_content table exists in your Supabase database
    • Run the supabase-setup.sql script in your Supabase SQL editor
    • Check your Supabase project permissions
  3. "Invalid Wikipedia URL"

    • Ensure the URL points to a Wikipedia page (contains 'wikipedia.org')
    • Use the full URL including the protocol (https://)
  4. "Network error"

    • Check your internet connection
    • Verify the Wikipedia page is accessible
    • Some pages might be rate-limited; try again later

Getting Supabase Credentials

  1. Go to supabase.com and create a project
  2. Navigate to Settings → API
  3. Copy the "Project URL" and "anon public" key
  4. Add them to your .env file

Contributing

This is a modular design that can be easily extended:

  • Add new parsers for different content types
  • Implement additional storage backends
  • Add content filtering or processing
  • Extend the CLI with additional options

License

This project is part of the vector-search repository.

About

search web interface over web page content stored in a vector db

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors