Skip to content

howardxie-dev/ml-agent-skills

Repository files navigation

ml-agent-skills

CI License: MIT Python 3.10+

Languages: English | 简体中文 | 日本語

Recommended repository description:

Agent Skills that help LLM coding agents run reproducible machine learning workflows.

ML Agent Skills helps LLM coding agents like Codex and Claude Code run reproducible machine learning workflows. This repository packages portable Agent Skills that make coding agents follow fixed workflows instead of improvising one-off scripts.

The flagship skill, tabular-ml-lab, helps agents run CSV binary classification workflows with data profiling, leakage checks, baseline training, metrics, model cards, and reports from a task.yaml file.

ml-agent-skills is not a production AutoML platform. It does not provide a SaaS backend, Web UI, notebook control, model deployment, monitoring, or production-readiness guarantees.

Useful for people searching for how to use LLM agents for machine learning: this repo shows how Agent Skills can turn Codex or Claude Code into a workflow agent for reproducible tabular machine learning experiments.

Why this exists

LLM coding agents can write machine learning code quickly, but they can also train misleading models if they skip data profiling or leakage checks.

In the credit_risk dogfood run, the first task included a post-outcome field named defaulted_reason_code. Both baseline models appeared to score 1.0, which looked impressive but was wrong to trust.

The workflow flagged defaulted_reason_code as high leakage before the result was trusted. After adding it to ignore_columns, the metrics returned to a more realistic baseline. That is the core idea of this project: let agents move quickly, but force the machine learning workflow through reviewable checks before anyone trusts the result.

Repository Layout

skills/
  tabular-ml-lab/
    SKILL.md
    references/
    scripts/
    assets/
src/
  agentic_tabular_ml/
examples/
docs/
adapters/
tests/

Current Skill

tabular-ml-lab supports:

  • CSV input only
  • Binary classification only
  • Seeded local train/test split
  • Baselines: Logistic Regression and RandomForestClassifier
  • Leakage screening with leakage_policy: warn | fail
  • Class balance warnings when accuracy may be misleading
  • Holdout metrics, average precision, threshold analysis, feature importance, model card, and final report

Using LLM Agents for Machine Learning

LLMs should not directly replace traditional machine learning models for tabular prediction tasks. They are not a reliable substitute for trained estimators, holdout validation, leakage checks, or metric review.

LLMs are more useful as workflow agents. Codex, Claude Code, and similar coding agents can read a task file, run deterministic scripts, inspect generated artifacts, summarize warnings, and help users iterate on a reproducible machine learning workflow.

ml-agent-skills gives those agents a fixed process. Instead of asking an LLM to invent a machine learning experiment from scratch, you can ask it to use tabular-ml-lab to run data profiling, leakage checks, baseline training, evaluation, model cards, and reports in a repeatable way.

See also:

Ask your agent

Copy one of these prompts into Codex or Claude Code:

Use the tabular-ml-lab skill to run a reproducible baseline ML workflow on examples/churn/task.yaml. Generate artifacts under outputs/churn and summarize final_report.md.
I want to use an LLM agent for machine learning on a CSV dataset. Create a task.yaml, inspect the data, check leakage risks, train baseline models, and generate a model card and final report.
In Codex, use the tabular-ml-lab Agent Skill to run examples/credit_risk/task.yaml, then explain why accuracy may be misleading and which threshold_report.csv trade-offs matter.
In Claude Code, use tabular-ml-lab to train Logistic Regression and RandomForest baselines from my CSV task.yaml. Do not add new model backends; use the existing reproducible workflow and report artifacts.

Install

This project uses uv to manage the Python environment.

uv sync

Run the test suite:

uv run pytest

Quick Start

Run the bundled churn example from the repository root:

uv run atm run examples/churn/task.yaml --output outputs/churn

You can also use the project command name:

uv run ml-agent-skills run examples/churn/task.yaml --output outputs/churn

Artifacts are written to the exact output directory passed with --output.

Artifacts

Every complete tabular-ml-lab run must generate:

  • data_profile.md
  • leakage_report.md
  • model_comparison.csv
  • metrics.json
  • threshold_report.csv
  • feature_importance.csv
  • model_card.md
  • final_report.md
  • model.pkl
  • run_manifest.json
  • resolved_task.yaml

See docs/artifact-contract.md for the full artifact contract.

threshold_report.csv reports precision, recall, f1, and predicted positive rate across default probability thresholds from 0.1 to 0.9. It is intended to make default-threshold trade-offs easier to review, especially for imbalanced binary classification tasks.

Task Configuration

Dataset paths are resolved relative to the task file. Output paths are resolved relative to the current working directory unless run.output_dir is absolute. CLI --output is treated as the exact output directory and overrides run.output_dir/run.name.

run:
  name: churn_baseline
  output_dir: outputs
  random_seed: 42

data:
  path: churn.csv
  format: csv
  target: churn
  ignore_columns:
    - customer_id

task:
  type: binary_classification

leakage_policy: warn

split:
  test_size: 0.25
  stratify: true

modeling:
  positive_label: 1
  selection_metric: roc_auc

leakage_policy defaults to warn. With warn, high severity leakage candidates are shown in the CLI summary, leakage_report.md, and final_report.md, but the workflow continues. With fail, high severity leakage candidates stop the workflow before model training. Medium and low severity leakage candidates are reported for review but do not trigger a policy failure.

The workflow reports class balance using positive_count, negative_count, positive_rate, and majority_class_baseline_accuracy. When the positive rate is below 0.2 or above 0.8, the CLI summary and reports warn that accuracy may be misleading.

Skill Installation

Install with gh skill

If your GitHub CLI has the gh skill command available, you can install and pin the public skill release directly. gh skill may still be preview or unavailable in some environments; if it is not available, use the adapter scripts below.

gh skill install howardxie-dev/ml-agent-skills tabular-ml-lab --agent codex --pin v0.2.1
gh skill install howardxie-dev/ml-agent-skills tabular-ml-lab --agent claude-code --pin v0.2.1

Pinning to v0.2.1 is recommended for repeatable agent behavior. Avoid blindly tracking main for shared or long-running workflows.

Install with adapters

Install the tabular-ml-lab skill into a Codex-compatible workspace:

bash adapters/codex/install.sh /path/to/workspace

Install the skill into a Claude Code-compatible workspace:

bash adapters/claude-code/install.sh /path/to/workspace

PowerShell installers are also available:

powershell -ExecutionPolicy Bypass -File adapters\codex\install.ps1 -WorkspaceRoot C:\path\to\workspace
powershell -ExecutionPolicy Bypass -File adapters\claude-code\install.ps1 -WorkspaceRoot C:\path\to\workspace

The adapter scripts copy only skills/tabular-ml-lab/. Python dependencies are managed by this repository's pyproject.toml and uv.lock.

Security Note

This project runs local Python scripts. Review the skill files and scripts before installing or executing them in your workspace.

Do not use sensitive data unless you understand your agent, shell, and tool data-sharing settings. The workflow is local-first, but agents and connected tools may have their own logging, telemetry, or sharing behavior depending on your environment.

For repeatable installs, prefer a pinned release tag such as v0.2.1 instead of tracking main.

Recommended GitHub Topics

  • agent-skills
  • llm-agents
  • machine-learning
  • codex
  • claude-code
  • automl
  • tabular-ml
  • reproducible-ml
  • data-leakage
  • model-card

Non-Goals

  • No Web UI
  • No browser or Jupyter control
  • No full AutoML search
  • No deployment pipeline
  • No monitoring
  • No claim that generated models are production-ready

License

MIT

About

Portable Agent Skills for reproducible ML workflows, starting with tabular data profiling, leakage checks, baseline training, evaluation, and reports.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors