Methodology and knowledge base for creating Analysis Ready Datasets—enhanced, materialized datasets that absorb the compute debt of research communities.
Analysis Ready Datasets (ARDs) transform raw catalog data into queryable products where common derived quantities have been pre-computed. Rather than distributing raw measurements and expecting each researcher to independently derive what they need, an ARD front-loads the computational cost of "high-friction metrics"—the calculations that appear repeatedly in published research but require significant compute to generate.
This section provides context for the ARD concept and methodology. If you're already familiar with the layer model, skip to Case Studies.
The ARD concept addresses a fundamental inefficiency in data-intensive research: every research group repeats the same foundational processing steps—fitting continua, measuring indices, calculating local densities. This redundancy wastes collective compute resources and creates barriers for researchers without HPC access.
ARDs extend beyond traditional Value-Added Catalogs (VACs) through three distinguishing characteristics:
- Literature-driven materialization: Derived quantities identified through systematic review of what researchers actually compute
- ML modality layers: Enhancement with embeddings, graph structures enabling new query patterns (similarity search, topological analysis)
- Community-scoped compute absorption: Expensive operations performed once with high rigor, converting processor time into storage space
| Audience | Use Case |
|---|---|
| ARD Implementers | Methodology guidance for building domain-specific ARDs |
| Data Scientists | Patterns for ML-enhanced dataset design |
| Researchers | Understanding what ARDs provide and how to use them |
| Area | Status | Description |
|---|---|---|
| Framework Documentation | 🔄 In Progress | Core methodology being extracted from case studies |
| DESI Case Study | 🔄 In Progress | Full layer stack implementation |
| Steam Case Study | ⬜ Planned | Retrospective documentation of proto-ARD |
| Publication | ⬜ Planned | Methodology paper with case studies |
ARDs are built in progressive layers, each enabling query patterns impossible at the layer below.
| Layer | Contents | Query Patterns Enabled |
|---|---|---|
| 0: Raw | Source measurements as released | Basic filtering, aggregation |
| 1: Scalars | Literature-validated derived quantities | Physics-based selection, diagnostics |
| 2: Vectors | Embedding representations | Similarity search, anomaly detection |
| 3: Graphs | Relationship structures | Topological queries, network analysis |
Each layer builds on the one below—embeddings over enriched feature spaces capture more meaning than embeddings over raw measurements alone.
analysis-ready-datasets/
├── 📂 docs/
│ ├── 📂 framework/ # ARD concept, layer model, methodology
│ ├── 📂 case-studies/ # Domain-specific implementations
│ │ ├── 📂 desi/ # DESI spectroscopic survey ARD
│ │ └── 📂 steam/ # Steam Dataset 2025 (proto-ARD)
│ └── 📂 documentation-standards/
├── 📂 research/ # GDR outputs, discovery artifacts
├── 📂 assets/ # Figures, diagrams
├── 📂 work-logs/ # Development history
├── 📄 LICENSE
├── 📄 LICENSE-DATA
└── 📄 README.md # This file
| Case Study | Domain | Status | Description |
|---|---|---|---|
| DESI ARD | Astronomy | 🔄 Active | ~6.4M extragalactic objects with full layer stack |
| Steam Dataset 2025 | Gaming | ⬜ Retrospective | Proto-ARD demonstrating cross-domain applicability |
The primary case study driving methodology development. The Dark Energy Spectroscopic Instrument provides ~18M spectra; this ARD implements scalar materializations (stellar mass, SFR, environmental metrics), spectral embeddings (Universal Spectrum Tokenizer, AstroCLIP), and cosmic web topology.
Implementation: desi-cosmic-void-galaxies
A gaming catalog enhanced with BGE-M3 embeddings and graph-ready relationships. Developed before the ARD concept was formalized, serving as retrospective validation that the pattern has practical value across domains.
Implementation: steam-dataset-2025
We practice open science and open methodology—our version of "showing your work":
- Research methodologies are fully documented and repeatable
- Discovery processes (GDR sessions, literature reviews) are preserved as artifacts
- Implementation patterns are published so others can adapt them
- Learning processes are captured for community benefit
- Review the Framework Documentation for methodology
- Study the DESI Case Study for a complete example
- Adapt the discovery approaches (literature-driven, schema-driven) to your domain
- Find the ARD for your domain in Case Studies
- Access the implementation repository for data products
- Query pre-computed columns instead of deriving them yourself
Documentation: MIT License — see LICENSE for details.
Data products: See LICENSE-DATA for dataset-specific terms.
- DESI Collaboration — Source data for primary case study
- Open source ML community — Foundation models enabling the vector layer
Last Updated: 2025-01-06 | Active Development


