Version: 1.0
Purpose: Efficient storage of multiple files in single chunks with fast random access retrieval
Target Use Case: Streaming files from object storage (S3) into archive chunks for distributed storage (Jackal)
The Chunk Archive Format (CAF) is designed to package multiple files into single archive files (chunks) while maintaining the ability to quickly extract individual files without reading the entire archive. This format optimizes for:
- Storage Efficiency: Reducing the number of individual files stored on distributed systems
- Fast Retrieval: Random access to individual files using byte ranges
- Streaming Support: Files can be written to the archive as they are received
- Size Management: Configurable maximum chunk size (default: ~30GB, max: 32GB)
A CAF file consists of three main sections:
┌─────────────────────────────────────┐
│ File Data Section │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ File 1 │ │ File 2 │ │
│ │ Data │ │ Data │ │
│ └─────────────┘ └─────────────┘ │
│ ... more files ... │
├─────────────────────────────────────┤
│ File Index Section │
│ (JSON-encoded map) │
├─────────────────────────────────────┤
│ Footer (4 bytes) │
│ ┌───────────────────────────────┐ │
│ │ Index Size │ │
│ │ (4 bytes) │ │
│ └───────────────────────────────┘ │
└─────────────────────────────────────┘
Files are stored sequentially in their original binary form. No compression or encoding is applied at this level - files are stored as-is to maintain integrity and simplify streaming operations.
Properties:
- Files are concatenated directly without padding or separators
- Original file content is preserved byte-for-byte
- Files are written in the order they are received
A JSON-encoded map that provides metadata for fast file location and retrieval.
Structure:
{
"format_version": "1.0",
"files": {
"path/to/file1.jpg": {
"start_byte": 0,
"end_byte": 1048575,
},
"documents/report.pdf": {
"start_byte": 1048576,
"end_byte": 2097151,
}
}
}Field Descriptions:
format_version: CAF format version for future compatibilityfiles: Map of filename to file metadatastart_byte: Byte offset where file data begins (0-indexed)end_byte: Byte offset where file data ends (exclusive)
The footer enables fast parsing by providing the index size.
Structure:
Bytes 0-3: Index Size (uint32, little-endian)
Details:
- Index Size: Size of the JSON index in bytes (excluding footer)
- Initialize: Open output stream/file for writing
- Stream Files: For each input file:
- Record current byte position as
start_byte - Stream file data directly to output
- Record final byte position as
end_byte - Add entry to in-memory file index
- Check if chunk size limit (~30GB) would be exceeded by next file
- Record current byte position as
- Finalize: When chunk is complete:
- Serialize file index to JSON
- Write index to output stream
- Calculate index size
- Write footer (index size)
- Read Footer: Read last 4 bytes of file
- Get Index Size: Extract index size from footer
- Read Index: Read index bytes from
file_size - 4 - index_size - Parse Index: JSON decode to get file map
- Lookup File: Find target filename in files map
- Perform Fast Lookup (above)
- Range Read: Read bytes from
start_bytetoend_byte - Return File Data: File is ready for use
- Maximum Chunk Size: 32GB (hard limit for compatibility)
- Target Chunk Size: ~30GB (recommended for optimal performance)
- Maximum Files per Chunk: No hard limit (limited by JSON parsing and memory)
- Maximum Filename Length: No hard limit (limited by JSON and filesystem)
- Index Size: Typically <1MB for thousands of files
When storing data, keeping a reference to which files belong to which CAF files is important. In a database, creating a mapping that includes filename -> caf_file_id will yield the best lookup results.
Example database structure:
CREATE TABLE file_locations (
file_path VARCHAR(512) PRIMARY KEY,
caf_file_id VARCHAR(50) NOT NULL,
);- Overhead: ~1MB index per chunk (for typical file counts)
- Compression: No built-in compression (can be added at storage layer)
- Deduplication: No built-in deduplication (handled at application layer)
- Index Lookup: O(1) average case for file location
- File Extraction: Single range read operation
- Chunk Creation: O(n) where n is number of files
- Memory Usage: Index size (~1MB) + streaming buffers