Skip to content

copyleftdev/certstream-arrow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CertStream Arrow

A robust Rust client for collecting and storing certificate transparency logs from CertStream into Apache Arrow format for efficient analysis.

Overview

CertStream Arrow connects to the CertStream WebSocket service, which provides real-time updates from the Certificate Transparency Log network. It processes these certificate updates and stores them in the Apache Arrow Feather format, enabling high-performance analysis with tools like Pandas, DuckDB, or other Arrow-compatible systems.

Features

  • Real-time Certificate Monitoring: Connect to CertStream WebSocket for live certificate updates
  • Efficient Storage: Converts certificate data to optimized Arrow columnar format
  • Robust Error Handling: Gracefully handles connection issues and parsing errors
  • Path Traversal Protection: Securely validates file paths
  • Configurable Batch Size: Control memory usage and write frequency
  • Periodic Writes: Ensures data is regularly persisted even with low volume

Installation

Prerequisites

  • Rust 1.54+ and Cargo

Building from Source

# Clone the repository
git clone https://github.com/yourusername/certstream-arrow.git
cd certstream-arrow

# Build the project
cargo build --release

# The binary will be available at target/release/certstream-arrow

Usage Examples

Basic Usage

# Start collecting certificates with default settings
./certstream-arrow

This will:

  • Connect to wss://certstream.calidog.io
  • Store certificates in certstream_data.feather in the current directory
  • Use a batch size of 1000 certificates
  • Log at INFO level

Custom Output File

# Specify a custom output file
./certstream-arrow --output /path/to/certificates.feather

Adjust Batch Size

For higher throughput or lower memory usage:

# Process in larger batches for higher throughput
./certstream-arrow --buffer-size 5000

# Process in smaller batches for lower memory usage
./certstream-arrow --buffer-size 200

Custom CertStream Endpoint

# Use an alternative CertStream provider
./certstream-arrow --url wss://alternative-certstream-provider.example.com

Debug Logging

# Enable debug logging for troubleshooting
./certstream-arrow --log-level debug

Complete Example with All Options

./certstream-arrow \
  --url wss://certstream.calidog.io \
  --output certificate_data.feather \
  --buffer-size 2000 \
  --log-level debug

Command Line Options

Option Default Description
--url, -u wss://certstream.calidog.io CertStream WebSocket URL
--output, -o certstream_data.feather Output file path
--buffer-size, -b 1000 Number of certificates to buffer before writing
--log-level, -l info Log level (trace, debug, info, warn, error)

Analyzing the Data

The resulting Arrow Feather file can be easily analyzed using various tools:

Python with Pandas

import pandas as pd

# Load the data
certs = pd.read_feather("certstream_data.feather")

# Basic exploration
print(f"Total certificates: {len(certs)}")
print(certs.head())

# Common analysis
top_domains = certs.explode('domains').value_counts('domains').head(10)
print(f"Top 10 domains: {top_domains}")

cert_by_issuer = certs.groupby('issuer')['cert_index'].count().sort_values(ascending=False)
print(f"Certificates by issuer: {cert_by_issuer.head()}")

DuckDB

-- Load the extension
LOAD 'parquet';

-- Query directly from the file
SELECT 
  count(*) as total_certs,
  count(DISTINCT issuer) as unique_issuers
FROM 'certstream_data.feather';

-- Finding suspicious domains
SELECT 
  domains,
  count(*) as cert_count
FROM 'certstream_data.feather',
  UNNEST(domains) as d(domain)
WHERE domain LIKE '%login%' OR domain LIKE '%secure%' OR domain LIKE '%bank%'
GROUP BY domain
ORDER BY cert_count DESC
LIMIT 20;

Architecture

CertStream Arrow follows a modular design with these main components:

  1. CertStreamClient: Handles WebSocket connections and message parsing
  2. ArrowWriter: Converts certificate data to Arrow format and writes to disk
  3. Config: Manages CLI arguments and application settings

Data flows through the system as follows:

  1. Data is received from the CertStream WebSocket
  2. Messages are deserialized into strongly-typed Rust structs
  3. Certificate data is buffered in memory
  4. When the buffer fills or timeout occurs, data is written to Arrow format
  5. The process continues until interrupted

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published