🎯 Analysis Ready Datasets

Methodology and knowledge base for creating Analysis Ready Datasets—enhanced, materialized datasets that absorb the compute debt of research communities.

Analysis Ready Datasets (ARDs) transform raw catalog data into queryable products where common derived quantities have been pre-computed. Rather than distributing raw measurements and expecting each researcher to independently derive what they need, an ARD front-loads the computational cost of "high-friction metrics"—the calculations that appear repeatedly in published research but require significant compute to generate.

🔭 Overview

This section provides context for the ARD concept and methodology. If you're already familiar with the layer model, skip to Case Studies.

The ARD concept addresses a fundamental inefficiency in data-intensive research: every research group repeats the same foundational processing steps—fitting continua, measuring indices, calculating local densities. This redundancy wastes collective compute resources and creates barriers for researchers without HPC access.

ARDs extend beyond traditional Value-Added Catalogs (VACs) through three distinguishing characteristics:

Literature-driven materialization: Derived quantities identified through systematic review of what researchers actually compute
ML modality layers: Enhancement with embeddings, graph structures enabling new query patterns (similarity search, topological analysis)
Community-scoped compute absorption: Expensive operations performed once with high rigor, converting processor time into storage space

🎯 Target Audience

Audience	Use Case
ARD Implementers	Methodology guidance for building domain-specific ARDs
Data Scientists	Patterns for ML-enhanced dataset design
Researchers	Understanding what ARDs provide and how to use them

📊 Project Status

Area	Status	Description
Framework Documentation	🔄 In Progress	Core methodology being extracted from case studies
DESI Case Study	🔄 In Progress	Full layer stack implementation
Steam Case Study	⬜ Planned	Retrospective documentation of proto-ARD
Publication	⬜ Planned	Methodology paper with case studies

🏗️ The Layer Model

ARDs are built in progressive layers, each enabling query patterns impossible at the layer below.

Layer	Contents	Query Patterns Enabled
0: Raw	Source measurements as released	Basic filtering, aggregation
1: Scalars	Literature-validated derived quantities	Physics-based selection, diagnostics
2: Vectors	Embedding representations	Similarity search, anomaly detection
3: Graphs	Relationship structures	Topological queries, network analysis

Each layer builds on the one below—embeddings over enriched feature spaces capture more meaning than embeddings over raw measurements alone.

📁 Repository Structure

analysis-ready-datasets/
├── 📂 docs/
│   ├── 📂 framework/           # ARD concept, layer model, methodology
│   ├── 📂 case-studies/        # Domain-specific implementations
│   │   ├── 📂 desi/            # DESI spectroscopic survey ARD
│   │   └── 📂 steam/           # Steam Dataset 2025 (proto-ARD)
│   └── 📂 documentation-standards/
├── 📂 research/                # GDR outputs, discovery artifacts
├── 📂 assets/                  # Figures, diagrams
├── 📂 work-logs/               # Development history
├── 📄 LICENSE
├── 📄 LICENSE-DATA
└── 📄 README.md                # This file

📚 Case Studies

Case Study	Domain	Status	Description
DESI ARD	Astronomy	🔄 Active	~6.4M extragalactic objects with full layer stack
Steam Dataset 2025	Gaming	⬜ Retrospective	Proto-ARD demonstrating cross-domain applicability

DESI Spectroscopic Survey

The primary case study driving methodology development. The Dark Energy Spectroscopic Instrument provides ~18M spectra; this ARD implements scalar materializations (stellar mass, SFR, environmental metrics), spectral embeddings (Universal Spectrum Tokenizer, AstroCLIP), and cosmic web topology.

Implementation: desi-cosmic-void-galaxies

Steam Dataset 2025

A gaming catalog enhanced with BGE-M3 embeddings and graph-ready relationships. Developed before the ARD concept was formalized, serving as retrospective validation that the pattern has practical value across domains.

Implementation: steam-dataset-2025

🌟 Open Science Philosophy

We practice open science and open methodology—our version of "showing your work":

Research methodologies are fully documented and repeatable
Discovery processes (GDR sessions, literature reviews) are preserved as artifacts
Implementation patterns are published so others can adapt them
Learning processes are captured for community benefit

🚀 Getting Started

For ARD Implementers

Review the Framework Documentation for methodology
Study the DESI Case Study for a complete example
Adapt the discovery approaches (literature-driven, schema-driven) to your domain

For Researchers Using ARDs

Find the ARD for your domain in Case Studies
Access the implementation repository for data products
Query pre-computed columns instead of deriving them yourself

📄 License

Documentation: MIT License — see LICENSE for details.
Data products: See LICENSE-DATA for dataset-specific terms.

🙏 Acknowledgments

DESI Collaboration — Source data for primary case study
Open source ML community — Foundation models enabling the vector layer

Last Updated: 2025-01-06 | Active Development

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.internal-files		.internal-files
assets		assets
docs		docs
research		research
shared		shared
work-logs		work-logs
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-DATA		LICENSE-DATA
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

🎯 Analysis Ready Datasets

🔭 Overview

🎯 Target Audience

📊 Project Status

🏗️ The Layer Model

📁 Repository Structure

📚 Case Studies

DESI Spectroscopic Survey

Steam Dataset 2025

🌟 Open Science Philosophy

🚀 Getting Started

For ARD Implementers

For Researchers Using ARDs

📄 License

🙏 Acknowledgments

About

Licenses found

Uh oh!

Uh oh!

Languages

License

Licenses found

radioastronomyio/analysis-ready-datasets

Folders and files

Latest commit

History

Repository files navigation

🎯 Analysis Ready Datasets

🔭 Overview

🎯 Target Audience

📊 Project Status

🏗️ The Layer Model

📁 Repository Structure

📚 Case Studies

DESI Spectroscopic Survey

Steam Dataset 2025

🌟 Open Science Philosophy

🚀 Getting Started

For ARD Implementers

For Researchers Using ARDs

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages