Skip to content

copyleftdev/web2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web2Vec: Web Content Vectorization Service

A highly performant web content tokenization and vectorization service built with Rust.

Features

  • Web Crawling: Crawl websites and extract textual content with depth control
  • Vectorization: Generate vector embeddings of web content using enhanced TF-IDF models
  • Storage: Store vectors in PostgreSQL with pgvector extension and efficient caching
  • Search: Find similar content based on vector similarity
  • REST API: Comprehensive API for crawling, vectorizing, searching, and site mapping

Requirements

  • Rust (latest stable version)
  • PostgreSQL with pgvector extension
  • Docker (for containerized deployment)

Enhanced Features

  • Intelligent Caching: Caching system for embeddings and content to improve performance
  • Depth-Limited Crawling: Control crawl depth for more targeted content collection
  • Site Mapping: Generate comprehensive sitemaps of crawled websites
  • Advanced Text Processing: Enhanced TF-IDF model with stopword filtering
  • Docker Integration: Complete containerization for easy deployment

Setup Instructions

Using Docker (Recommended)

The easiest way to run Web2Vec is using Docker:

# Clone the repository
git clone https://github.com/copyleftdev/web2vec.git
cd web2vec

# Start the services
docker-compose up -d

This will start both the PostgreSQL database with pgvector extension and the Web2Vec service.

Manual Setup

Database Setup

  1. Install PostgreSQL and pgvector extension:
# Ubuntu/Debian
sudo apt install postgresql postgresql-contrib

# Install pgvector (https://github.com/pgvector/pgvector)
git clone https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install
  1. Create the database and enable pgvector:
CREATE DATABASE web2vec;
\c web2vec
CREATE EXTENSION vector;
  1. Set up the tables (see sql/init.sql for schema)

Application Setup

  1. Install Rust (if not already installed):
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
  1. Configure the database connection in .env:
DATABASE_URL="postgres://postgres:postgres@localhost:5432/web2vec"
  1. Build and run the application:
cargo build --release
./target/release/web2vec

API Usage

Vectorize URLs

curl -X POST http://localhost:8080/vectorize \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "max_depth": 2}'

Search Similar Content

curl -X GET "http://localhost:8080/search?query=your%20search%20query&limit=10"

Generate Site Map

curl -X GET "http://localhost:8080/sitemap?url=https://example.com&max_depth=3"

Server Stats

curl -X GET http://localhost:8080/stats

Configuration

Configuration is handled via a YAML file (config.yml). Key options include:

crawler:
  max_depth: 3
  max_concurrent_requests: 10
  timeout_seconds: 30
  user_agent: "Web2Vec Crawler v1.0"

vectorizer:
  model: "enhanced-tfidf"
  vector_dimension: 100
  use_cache: true
  cache_ttl_seconds: 3600
  use_stopwords: true

database:
  pool_size: 5

logging:
  level: "info"

Architecture

Web2Vec consists of several core modules:

  1. Crawler: Handles web page fetching and content extraction
  2. Vectorizer: Transforms text into vector embeddings using TF-IDF techniques
  3. Database: Manages persistence and retrieval of vectors and content
  4. API: Provides HTTP interfaces to the service

License

Web2Vec is licensed under the DBAD-X ("Don't Be A Dick eXtreme") License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published