Skip to content

Latest commit

 

History

History
169 lines (121 loc) · 4.59 KB

README.md

File metadata and controls

169 lines (121 loc) · 4.59 KB

CertStream Arrow

A robust Rust client for collecting and storing certificate transparency logs from CertStream into Apache Arrow format for efficient analysis.

Overview

CertStream Arrow connects to the CertStream WebSocket service, which provides real-time updates from the Certificate Transparency Log network. It processes these certificate updates and stores them in the Apache Arrow Feather format, enabling high-performance analysis with tools like Pandas, DuckDB, or other Arrow-compatible systems.

Features

  • Real-time Certificate Monitoring: Connect to CertStream WebSocket for live certificate updates
  • Efficient Storage: Converts certificate data to optimized Arrow columnar format
  • Robust Error Handling: Gracefully handles connection issues and parsing errors
  • Path Traversal Protection: Securely validates file paths
  • Configurable Batch Size: Control memory usage and write frequency
  • Periodic Writes: Ensures data is regularly persisted even with low volume

Installation

Prerequisites

  • Rust 1.54+ and Cargo

Building from Source

# Clone the repository
git clone https://github.com/yourusername/certstream-arrow.git
cd certstream-arrow

# Build the project
cargo build --release

# The binary will be available at target/release/certstream-arrow

Usage Examples

Basic Usage

# Start collecting certificates with default settings
./certstream-arrow

This will:

  • Connect to wss://certstream.calidog.io
  • Store certificates in certstream_data.feather in the current directory
  • Use a batch size of 1000 certificates
  • Log at INFO level

Custom Output File

# Specify a custom output file
./certstream-arrow --output /path/to/certificates.feather

Adjust Batch Size

For higher throughput or lower memory usage:

# Process in larger batches for higher throughput
./certstream-arrow --buffer-size 5000

# Process in smaller batches for lower memory usage
./certstream-arrow --buffer-size 200

Custom CertStream Endpoint

# Use an alternative CertStream provider
./certstream-arrow --url wss://alternative-certstream-provider.example.com

Debug Logging

# Enable debug logging for troubleshooting
./certstream-arrow --log-level debug

Complete Example with All Options

./certstream-arrow \
  --url wss://certstream.calidog.io \
  --output certificate_data.feather \
  --buffer-size 2000 \
  --log-level debug

Command Line Options

Option Default Description
--url, -u wss://certstream.calidog.io CertStream WebSocket URL
--output, -o certstream_data.feather Output file path
--buffer-size, -b 1000 Number of certificates to buffer before writing
--log-level, -l info Log level (trace, debug, info, warn, error)

Analyzing the Data

The resulting Arrow Feather file can be easily analyzed using various tools:

Python with Pandas

import pandas as pd

# Load the data
certs = pd.read_feather("certstream_data.feather")

# Basic exploration
print(f"Total certificates: {len(certs)}")
print(certs.head())

# Common analysis
top_domains = certs.explode('domains').value_counts('domains').head(10)
print(f"Top 10 domains: {top_domains}")

cert_by_issuer = certs.groupby('issuer')['cert_index'].count().sort_values(ascending=False)
print(f"Certificates by issuer: {cert_by_issuer.head()}")

DuckDB

-- Load the extension
LOAD 'parquet';

-- Query directly from the file
SELECT 
  count(*) as total_certs,
  count(DISTINCT issuer) as unique_issuers
FROM 'certstream_data.feather';

-- Finding suspicious domains
SELECT 
  domains,
  count(*) as cert_count
FROM 'certstream_data.feather',
  UNNEST(domains) as d(domain)
WHERE domain LIKE '%login%' OR domain LIKE '%secure%' OR domain LIKE '%bank%'
GROUP BY domain
ORDER BY cert_count DESC
LIMIT 20;

Architecture

CertStream Arrow follows a modular design with these main components:

  1. CertStreamClient: Handles WebSocket connections and message parsing
  2. ArrowWriter: Converts certificate data to Arrow format and writes to disk
  3. Config: Manages CLI arguments and application settings

Data flows through the system as follows:

  1. Data is received from the CertStream WebSocket
  2. Messages are deserialized into strongly-typed Rust structs
  3. Certificate data is buffered in memory
  4. When the buffer fills or timeout occurs, data is written to Arrow format
  5. The process continues until interrupted

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.