A robust Rust client for collecting and storing certificate transparency logs from CertStream into Apache Arrow format for efficient analysis.
CertStream Arrow connects to the CertStream WebSocket service, which provides real-time updates from the Certificate Transparency Log network. It processes these certificate updates and stores them in the Apache Arrow Feather format, enabling high-performance analysis with tools like Pandas, DuckDB, or other Arrow-compatible systems.
- Real-time Certificate Monitoring: Connect to CertStream WebSocket for live certificate updates
- Efficient Storage: Converts certificate data to optimized Arrow columnar format
- Robust Error Handling: Gracefully handles connection issues and parsing errors
- Path Traversal Protection: Securely validates file paths
- Configurable Batch Size: Control memory usage and write frequency
- Periodic Writes: Ensures data is regularly persisted even with low volume
- Rust 1.54+ and Cargo
# Clone the repository
git clone https://github.com/yourusername/certstream-arrow.git
cd certstream-arrow
# Build the project
cargo build --release
# The binary will be available at target/release/certstream-arrow
# Start collecting certificates with default settings
./certstream-arrow
This will:
- Connect to wss://certstream.calidog.io
- Store certificates in certstream_data.feather in the current directory
- Use a batch size of 1000 certificates
- Log at INFO level
# Specify a custom output file
./certstream-arrow --output /path/to/certificates.feather
For higher throughput or lower memory usage:
# Process in larger batches for higher throughput
./certstream-arrow --buffer-size 5000
# Process in smaller batches for lower memory usage
./certstream-arrow --buffer-size 200
# Use an alternative CertStream provider
./certstream-arrow --url wss://alternative-certstream-provider.example.com
# Enable debug logging for troubleshooting
./certstream-arrow --log-level debug
./certstream-arrow \
--url wss://certstream.calidog.io \
--output certificate_data.feather \
--buffer-size 2000 \
--log-level debug
Option | Default | Description |
---|---|---|
--url , -u |
wss://certstream.calidog.io | CertStream WebSocket URL |
--output , -o |
certstream_data.feather | Output file path |
--buffer-size , -b |
1000 | Number of certificates to buffer before writing |
--log-level , -l |
info | Log level (trace, debug, info, warn, error) |
The resulting Arrow Feather file can be easily analyzed using various tools:
import pandas as pd
# Load the data
certs = pd.read_feather("certstream_data.feather")
# Basic exploration
print(f"Total certificates: {len(certs)}")
print(certs.head())
# Common analysis
top_domains = certs.explode('domains').value_counts('domains').head(10)
print(f"Top 10 domains: {top_domains}")
cert_by_issuer = certs.groupby('issuer')['cert_index'].count().sort_values(ascending=False)
print(f"Certificates by issuer: {cert_by_issuer.head()}")
-- Load the extension
LOAD 'parquet';
-- Query directly from the file
SELECT
count(*) as total_certs,
count(DISTINCT issuer) as unique_issuers
FROM 'certstream_data.feather';
-- Finding suspicious domains
SELECT
domains,
count(*) as cert_count
FROM 'certstream_data.feather',
UNNEST(domains) as d(domain)
WHERE domain LIKE '%login%' OR domain LIKE '%secure%' OR domain LIKE '%bank%'
GROUP BY domain
ORDER BY cert_count DESC
LIMIT 20;
CertStream Arrow follows a modular design with these main components:
- CertStreamClient: Handles WebSocket connections and message parsing
- ArrowWriter: Converts certificate data to Arrow format and writes to disk
- Config: Manages CLI arguments and application settings
Data flows through the system as follows:
- Data is received from the CertStream WebSocket
- Messages are deserialized into strongly-typed Rust structs
- Certificate data is buffered in memory
- When the buffer fills or timeout occurs, data is written to Arrow format
- The process continues until interrupted
Contributions are welcome! Please feel free to submit a Pull Request.