ml-agent-skills

Recommended repository description:

Agent Skills that help LLM coding agents run reproducible machine learning workflows.

ML Agent Skills helps LLM coding agents like Codex and Claude Code run reproducible machine learning workflows. This repository packages portable Agent Skills that make coding agents follow fixed workflows instead of improvising one-off scripts.

The flagship skill, tabular-ml-lab, helps agents run CSV binary classification workflows with data profiling, leakage checks, baseline training, metrics, model cards, and reports from a task.yaml file.

ml-agent-skills is not a production AutoML platform. It does not provide a SaaS backend, Web UI, notebook control, model deployment, monitoring, or production-readiness guarantees.

Useful for people searching for how to use LLM agents for machine learning: this repo shows how Agent Skills can turn Codex or Claude Code into a workflow agent for reproducible tabular machine learning experiments.

Why this exists

LLM coding agents can write machine learning code quickly, but they can also train misleading models if they skip data profiling or leakage checks.

In the credit_risk dogfood run, the first task included a post-outcome field named defaulted_reason_code. Both baseline models appeared to score 1.0, which looked impressive but was wrong to trust.

The workflow flagged defaulted_reason_code as high leakage before the result was trusted. After adding it to ignore_columns, the metrics returned to a more realistic baseline. That is the core idea of this project: let agents move quickly, but force the machine learning workflow through reviewable checks before anyone trusts the result.

Repository Layout

skills/
  tabular-ml-lab/
    SKILL.md
    references/
    scripts/
    assets/
src/
  agentic_tabular_ml/
examples/
docs/
adapters/
tests/

Current Skill

tabular-ml-lab supports:

CSV input only
Binary classification only
Seeded local train/test split
Baselines: Logistic Regression and RandomForestClassifier
Leakage screening with leakage_policy: warn | fail
Class balance warnings when accuracy may be misleading
Holdout metrics, average precision, threshold analysis, feature importance, model card, and final report

Using LLM Agents for Machine Learning

LLMs should not directly replace traditional machine learning models for tabular prediction tasks. They are not a reliable substitute for trained estimators, holdout validation, leakage checks, or metric review.

LLMs are more useful as workflow agents. Codex, Claude Code, and similar coding agents can read a task file, run deterministic scripts, inspect generated artifacts, summarize warnings, and help users iterate on a reproducible machine learning workflow.

ml-agent-skills gives those agents a fixed process. Instead of asking an LLM to invent a machine learning experiment from scratch, you can ask it to use tabular-ml-lab to run data profiling, leakage checks, baseline training, evaluation, model cards, and reports in a repeatable way.

Ask your agent

Copy one of these prompts into Codex or Claude Code:

Use the tabular-ml-lab skill to run a reproducible baseline ML workflow on examples/churn/task.yaml. Generate artifacts under outputs/churn and summarize final_report.md.

I want to use an LLM agent for machine learning on a CSV dataset. Create a task.yaml, inspect the data, check leakage risks, train baseline models, and generate a model card and final report.

In Codex, use the tabular-ml-lab Agent Skill to run examples/credit_risk/task.yaml, then explain why accuracy may be misleading and which threshold_report.csv trade-offs matter.

In Claude Code, use tabular-ml-lab to train Logistic Regression and RandomForest baselines from my CSV task.yaml. Do not add new model backends; use the existing reproducible workflow and report artifacts.

Install

This project uses uv to manage the Python environment.

uv sync

Run the test suite:

uv run pytest

Quick Start

Run the bundled churn example from the repository root:

uv run atm run examples/churn/task.yaml --output outputs/churn

You can also use the project command name:

uv run ml-agent-skills run examples/churn/task.yaml --output outputs/churn

Artifacts are written to the exact output directory passed with --output.

Artifacts

Every complete tabular-ml-lab run must generate:

data_profile.md
leakage_report.md
model_comparison.csv
metrics.json
threshold_report.csv
feature_importance.csv
model_card.md
final_report.md
model.pkl
run_manifest.json
resolved_task.yaml

See docs/artifact-contract.md for the full artifact contract.

threshold_report.csv reports precision, recall, f1, and predicted positive rate across default probability thresholds from 0.1 to 0.9. It is intended to make default-threshold trade-offs easier to review, especially for imbalanced binary classification tasks.

Task Configuration

Dataset paths are resolved relative to the task file. Output paths are resolved relative to the current working directory unless run.output_dir is absolute. CLI --output is treated as the exact output directory and overrides run.output_dir/run.name.

run:
  name: churn_baseline
  output_dir: outputs
  random_seed: 42

data:
  path: churn.csv
  format: csv
  target: churn
  ignore_columns:
    - customer_id

task:
  type: binary_classification

leakage_policy: warn

split:
  test_size: 0.25
  stratify: true

modeling:
  positive_label: 1
  selection_metric: roc_auc

leakage_policy defaults to warn. With warn, high severity leakage candidates are shown in the CLI summary, leakage_report.md, and final_report.md, but the workflow continues. With fail, high severity leakage candidates stop the workflow before model training. Medium and low severity leakage candidates are reported for review but do not trigger a policy failure.

The workflow reports class balance using positive_count, negative_count, positive_rate, and majority_class_baseline_accuracy. When the positive rate is below 0.2 or above 0.8, the CLI summary and reports warn that accuracy may be misleading.

Skill Installation

Install with gh skill

If your GitHub CLI has the gh skill command available, you can install and pin the public skill release directly. gh skill may still be preview or unavailable in some environments; if it is not available, use the adapter scripts below.

gh skill install howardxie-dev/ml-agent-skills tabular-ml-lab --agent codex --pin v0.2.1
gh skill install howardxie-dev/ml-agent-skills tabular-ml-lab --agent claude-code --pin v0.2.1

Pinning to v0.2.1 is recommended for repeatable agent behavior. Avoid blindly tracking main for shared or long-running workflows.

Install with adapters

Install the tabular-ml-lab skill into a Codex-compatible workspace:

bash adapters/codex/install.sh /path/to/workspace

Install the skill into a Claude Code-compatible workspace:

bash adapters/claude-code/install.sh /path/to/workspace

PowerShell installers are also available:

powershell -ExecutionPolicy Bypass -File adapters\codex\install.ps1 -WorkspaceRoot C:\path\to\workspace
powershell -ExecutionPolicy Bypass -File adapters\claude-code\install.ps1 -WorkspaceRoot C:\path\to\workspace

The adapter scripts copy only skills/tabular-ml-lab/. Python dependencies are managed by this repository's pyproject.toml and uv.lock.

Security Note

This project runs local Python scripts. Review the skill files and scripts before installing or executing them in your workspace.

Do not use sensitive data unless you understand your agent, shell, and tool data-sharing settings. The workflow is local-first, but agents and connected tools may have their own logging, telemetry, or sharing behavior depending on your environment.

For repeatable installs, prefer a pinned release tag such as v0.2.1 instead of tracking main.

Non-Goals

No Web UI
No browser or Jupyter control
No full AutoML search
No deployment pipeline
No monitoring
No claim that generated models are production-ready

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
adapters		adapters
docs		docs
evals		evals
examples		examples
skills/tabular-ml-lab		skills/tabular-ml-lab
src/agentic_tabular_ml		src/agentic_tabular_ml
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.ja.md		README.ja.md
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ml-agent-skills

Why this exists

Repository Layout

Current Skill

Using LLM Agents for Machine Learning

Ask your agent

Install

Quick Start

Artifacts

Task Configuration

Skill Installation

Install with gh skill

Install with adapters

Security Note

Recommended GitHub Topics

Non-Goals

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ml-agent-skills

Why this exists

Repository Layout

Current Skill

Using LLM Agents for Machine Learning

Ask your agent

Install

Quick Start

Artifacts

Task Configuration

Skill Installation

Install with gh skill

Install with adapters

Security Note

Recommended GitHub Topics

Non-Goals

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages