A highly performant web content tokenization and vectorization service built with Rust.
- Web Crawling: Crawl websites and extract textual content with depth control
- Vectorization: Generate vector embeddings of web content using enhanced TF-IDF models
- Storage: Store vectors in PostgreSQL with pgvector extension and efficient caching
- Search: Find similar content based on vector similarity
- REST API: Comprehensive API for crawling, vectorizing, searching, and site mapping
- Rust (latest stable version)
- PostgreSQL with pgvector extension
- Docker (for containerized deployment)
- Intelligent Caching: Caching system for embeddings and content to improve performance
- Depth-Limited Crawling: Control crawl depth for more targeted content collection
- Site Mapping: Generate comprehensive sitemaps of crawled websites
- Advanced Text Processing: Enhanced TF-IDF model with stopword filtering
- Docker Integration: Complete containerization for easy deployment
The easiest way to run Web2Vec is using Docker:
# Clone the repository
git clone https://github.com/copyleftdev/web2vec.git
cd web2vec
# Start the services
docker-compose up -d
This will start both the PostgreSQL database with pgvector extension and the Web2Vec service.
- Install PostgreSQL and pgvector extension:
# Ubuntu/Debian
sudo apt install postgresql postgresql-contrib
# Install pgvector (https://github.com/pgvector/pgvector)
git clone https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install
- Create the database and enable pgvector:
CREATE DATABASE web2vec;
\c web2vec
CREATE EXTENSION vector;
- Set up the tables (see
sql/init.sql
for schema)
- Install Rust (if not already installed):
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
- Configure the database connection in
.env
:
DATABASE_URL="postgres://postgres:postgres@localhost:5432/web2vec"
- Build and run the application:
cargo build --release
./target/release/web2vec
curl -X POST http://localhost:8080/vectorize \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com"], "max_depth": 2}'
curl -X GET "http://localhost:8080/search?query=your%20search%20query&limit=10"
curl -X GET "http://localhost:8080/sitemap?url=https://example.com&max_depth=3"
curl -X GET http://localhost:8080/stats
Configuration is handled via a YAML file (config.yml
). Key options include:
crawler:
max_depth: 3
max_concurrent_requests: 10
timeout_seconds: 30
user_agent: "Web2Vec Crawler v1.0"
vectorizer:
model: "enhanced-tfidf"
vector_dimension: 100
use_cache: true
cache_ttl_seconds: 3600
use_stopwords: true
database:
pool_size: 5
logging:
level: "info"
Web2Vec consists of several core modules:
- Crawler: Handles web page fetching and content extraction
- Vectorizer: Transforms text into vector embeddings using TF-IDF techniques
- Database: Manages persistence and retrieval of vectors and content
- API: Provides HTTP interfaces to the service
Web2Vec is licensed under the DBAD-X ("Don't Be A Dick eXtreme") License. See the LICENSE file for details.