Web2Vec: Web Content Vectorization Service

A highly performant web content tokenization and vectorization service built with Rust.

Features

Web Crawling: Crawl websites and extract textual content with depth control
Vectorization: Generate vector embeddings of web content using enhanced TF-IDF models
Storage: Store vectors in PostgreSQL with pgvector extension and efficient caching
Search: Find similar content based on vector similarity
REST API: Comprehensive API for crawling, vectorizing, searching, and site mapping

Requirements

Rust (latest stable version)
PostgreSQL with pgvector extension
Docker (for containerized deployment)

Enhanced Features

Intelligent Caching: Caching system for embeddings and content to improve performance
Depth-Limited Crawling: Control crawl depth for more targeted content collection
Site Mapping: Generate comprehensive sitemaps of crawled websites
Advanced Text Processing: Enhanced TF-IDF model with stopword filtering
Docker Integration: Complete containerization for easy deployment

Setup Instructions

Using Docker (Recommended)

The easiest way to run Web2Vec is using Docker:

# Clone the repository
git clone https://github.com/copyleftdev/web2vec.git
cd web2vec

# Start the services
docker-compose up -d

This will start both the PostgreSQL database with pgvector extension and the Web2Vec service.

Manual Setup

Database Setup

Install PostgreSQL and pgvector extension:

# Ubuntu/Debian
sudo apt install postgresql postgresql-contrib

# Install pgvector (https://github.com/pgvector/pgvector)
git clone https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install

Create the database and enable pgvector:

CREATE DATABASE web2vec;
\c web2vec
CREATE EXTENSION vector;

Set up the tables (see sql/init.sql for schema)

Application Setup

Install Rust (if not already installed):

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

Configure the database connection in .env:

DATABASE_URL="postgres://postgres:postgres@localhost:5432/web2vec"

Build and run the application:

cargo build --release
./target/release/web2vec

API Usage

Vectorize URLs

curl -X POST http://localhost:8080/vectorize \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "max_depth": 2}'

Search Similar Content

curl -X GET "http://localhost:8080/search?query=your%20search%20query&limit=10"

Generate Site Map

curl -X GET "http://localhost:8080/sitemap?url=https://example.com&max_depth=3"

Server Stats

curl -X GET http://localhost:8080/stats

Configuration

Configuration is handled via a YAML file (config.yml). Key options include:

crawler:
  max_depth: 3
  max_concurrent_requests: 10
  timeout_seconds: 30
  user_agent: "Web2Vec Crawler v1.0"

vectorizer:
  model: "enhanced-tfidf"
  vector_dimension: 100
  use_cache: true
  cache_ttl_seconds: 3600
  use_stopwords: true

database:
  pool_size: 5

logging:
  level: "info"

Architecture

Web2Vec consists of several core modules:

Crawler: Handles web page fetching and content extraction
Vectorizer: Transforms text into vector embeddings using TF-IDF techniques
Database: Manages persistence and retrieval of vectors and content
API: Provides HTTP interfaces to the service

License

Web2Vec is licensed under the DBAD-X ("Don't Be A Dick eXtreme") License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
sql		sql
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
settings.yaml		settings.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web2Vec: Web Content Vectorization Service

Features

Requirements

Enhanced Features

Setup Instructions

Using Docker (Recommended)

Manual Setup

Database Setup

Application Setup

API Usage

Vectorize URLs

Search Similar Content

Generate Site Map

Server Stats

Configuration

Architecture

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

copyleftdev/web2vec

Folders and files

Latest commit

History

Repository files navigation

Web2Vec: Web Content Vectorization Service

Features

Requirements

Enhanced Features

Setup Instructions

Using Docker (Recommended)

Manual Setup

Database Setup

Application Setup

API Usage

Vectorize URLs

Search Similar Content

Generate Site Map

Server Stats

Configuration

Architecture

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages