Chunk Archive Format (CAF) Specification

Version: 1.0
Purpose: Efficient storage of multiple files in single chunks with fast random access retrieval
Target Use Case: Streaming files from object storage (S3) into archive chunks for distributed storage (Jackal)

Overview

The Chunk Archive Format (CAF) is designed to package multiple files into single archive files (chunks) while maintaining the ability to quickly extract individual files without reading the entire archive. This format optimizes for:

Storage Efficiency: Reducing the number of individual files stored on distributed systems
Fast Retrieval: Random access to individual files using byte ranges
Streaming Support: Files can be written to the archive as they are received
Size Management: Configurable maximum chunk size (default: ~30GB, max: 32GB)

File Structure

A CAF file consists of three main sections:

┌─────────────────────────────────────┐
│           File Data Section         │
│  ┌─────────────┐ ┌─────────────┐    │
│  │   File 1    │ │   File 2    │    │
│  │   Data      │ │   Data      │    │
│  └─────────────┘ └─────────────┘    │
│            ... more files ...       │
├─────────────────────────────────────┤
│         File Index Section          │
│        (JSON-encoded map)           │
├─────────────────────────────────────┤
│       Footer (4 bytes)              │
│  ┌───────────────────────────────┐  │
│  │         Index Size            │  │
│  │         (4 bytes)             │  │
│  └───────────────────────────────┘  │
└─────────────────────────────────────┘

Section Details

1. File Data Section

Files are stored sequentially in their original binary form. No compression or encoding is applied at this level - files are stored as-is to maintain integrity and simplify streaming operations.

Properties:

Files are concatenated directly without padding or separators
Original file content is preserved byte-for-byte
Files are written in the order they are received

2. File Index Section

A JSON-encoded map that provides metadata for fast file location and retrieval.

Structure:

{
  "format_version": "1.0",
  "files": {
    "path/to/file1.jpg": {
      "start_byte": 0,
      "end_byte": 1048575,
    },
    "documents/report.pdf": {
      "start_byte": 1048576,
      "end_byte": 2097151,
    }
  }
}

Field Descriptions:

format_version: CAF format version for future compatibility
files: Map of filename to file metadata
- start_byte: Byte offset where file data begins (0-indexed)
- end_byte: Byte offset where file data ends (exclusive)

3. Footer Section (4 bytes)

The footer enables fast parsing by providing the index size.

Structure:

Bytes 0-3: Index Size (uint32, little-endian)

Details:

Index Size: Size of the JSON index in bytes (excluding footer)

Implementation Guidelines

Creating a CAF File

Initialize: Open output stream/file for writing
Stream Files: For each input file:
- Record current byte position as start_byte
- Stream file data directly to output
- Record final byte position as end_byte
- Add entry to in-memory file index
- Check if chunk size limit (~30GB) would be exceeded by next file
Finalize: When chunk is complete:
- Serialize file index to JSON
- Write index to output stream
- Calculate index size
- Write footer (index size)

Reading from a CAF File

Fast File Lookup

Read Footer: Read last 4 bytes of file
Get Index Size: Extract index size from footer
Read Index: Read index bytes from file_size - 4 - index_size
Parse Index: JSON decode to get file map
Lookup File: Find target filename in files map

Extract Specific File

Perform Fast Lookup (above)
Range Read: Read bytes from start_byte to end_byte
Return File Data: File is ready for use

Size Limits and Constraints

Maximum Chunk Size: 32GB (hard limit for compatibility)
Target Chunk Size: ~30GB (recommended for optimal performance)
Maximum Files per Chunk: No hard limit (limited by JSON parsing and memory)
Maximum Filename Length: No hard limit (limited by JSON and filesystem)
Index Size: Typically <1MB for thousands of files

Database Integration

When storing data, keeping a reference to which files belong to which CAF files is important. In a database, creating a mapping that includes filename -> caf_file_id will yield the best lookup results.

Example database structure:

CREATE TABLE file_locations (
    file_path VARCHAR(512) PRIMARY KEY,
    caf_file_id VARCHAR(50) NOT NULL,
);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk Archive Format (CAF) Specification

Overview

File Structure

Section Details

1. File Data Section

2. File Index Section

3. Footer Section (4 bytes)

Implementation Guidelines

Creating a CAF File

Reading from a CAF File

Fast File Lookup

Extract Specific File

Size Limits and Constraints

Database Integration

Performance Characteristics

Storage Efficiency

Access Performance

FilesExpand file tree

FORMAT.md

Latest commit

History

FORMAT.md

File metadata and controls

Chunk Archive Format (CAF) Specification

Overview

File Structure

Section Details

1. File Data Section

2. File Index Section

3. Footer Section (4 bytes)

Implementation Guidelines

Creating a CAF File

Reading from a CAF File

Fast File Lookup

Extract Specific File

Size Limits and Constraints

Database Integration

Performance Characteristics

Storage Efficiency

Access Performance