Languages: English | 简体中文 | 日本語
Recommended repository description:
Agent Skills that help LLM coding agents run reproducible machine learning workflows.
ML Agent Skills helps LLM coding agents like Codex and Claude Code run reproducible machine learning workflows. This repository packages portable Agent Skills that make coding agents follow fixed workflows instead of improvising one-off scripts.
The flagship skill, tabular-ml-lab, helps agents run CSV binary classification workflows with data profiling, leakage checks, baseline training, metrics, model cards, and reports from a task.yaml file.
ml-agent-skills is not a production AutoML platform. It does not provide a SaaS backend, Web UI, notebook control, model deployment, monitoring, or production-readiness guarantees.
Useful for people searching for how to use LLM agents for machine learning: this repo shows how Agent Skills can turn Codex or Claude Code into a workflow agent for reproducible tabular machine learning experiments.
LLM coding agents can write machine learning code quickly, but they can also train misleading models if they skip data profiling or leakage checks.
In the credit_risk dogfood run, the first task included a post-outcome field named defaulted_reason_code. Both baseline models appeared to score 1.0, which looked impressive but was wrong to trust.
The workflow flagged defaulted_reason_code as high leakage before the result was trusted. After adding it to ignore_columns, the metrics returned to a more realistic baseline. That is the core idea of this project: let agents move quickly, but force the machine learning workflow through reviewable checks before anyone trusts the result.
skills/
tabular-ml-lab/
SKILL.md
references/
scripts/
assets/
src/
agentic_tabular_ml/
examples/
docs/
adapters/
tests/
tabular-ml-lab supports:
- CSV input only
- Binary classification only
- Seeded local train/test split
- Baselines: Logistic Regression and RandomForestClassifier
- Leakage screening with
leakage_policy: warn | fail - Class balance warnings when accuracy may be misleading
- Holdout metrics, average precision, threshold analysis, feature importance, model card, and final report
LLMs should not directly replace traditional machine learning models for tabular prediction tasks. They are not a reliable substitute for trained estimators, holdout validation, leakage checks, or metric review.
LLMs are more useful as workflow agents. Codex, Claude Code, and similar coding agents can read a task file, run deterministic scripts, inspect generated artifacts, summarize warnings, and help users iterate on a reproducible machine learning workflow.
ml-agent-skills gives those agents a fixed process. Instead of asking an LLM to invent a machine learning experiment from scratch, you can ask it to use tabular-ml-lab to run data profiling, leakage checks, baseline training, evaluation, model cards, and reports in a repeatable way.
See also:
Copy one of these prompts into Codex or Claude Code:
Use the tabular-ml-lab skill to run a reproducible baseline ML workflow on examples/churn/task.yaml. Generate artifacts under outputs/churn and summarize final_report.md.
I want to use an LLM agent for machine learning on a CSV dataset. Create a task.yaml, inspect the data, check leakage risks, train baseline models, and generate a model card and final report.
In Codex, use the tabular-ml-lab Agent Skill to run examples/credit_risk/task.yaml, then explain why accuracy may be misleading and which threshold_report.csv trade-offs matter.
In Claude Code, use tabular-ml-lab to train Logistic Regression and RandomForest baselines from my CSV task.yaml. Do not add new model backends; use the existing reproducible workflow and report artifacts.
This project uses uv to manage the Python environment.
uv syncRun the test suite:
uv run pytestRun the bundled churn example from the repository root:
uv run atm run examples/churn/task.yaml --output outputs/churnYou can also use the project command name:
uv run ml-agent-skills run examples/churn/task.yaml --output outputs/churnArtifacts are written to the exact output directory passed with --output.
Every complete tabular-ml-lab run must generate:
data_profile.mdleakage_report.mdmodel_comparison.csvmetrics.jsonthreshold_report.csvfeature_importance.csvmodel_card.mdfinal_report.mdmodel.pklrun_manifest.jsonresolved_task.yaml
See docs/artifact-contract.md for the full artifact contract.
threshold_report.csv reports precision, recall, f1, and predicted positive rate across default probability thresholds from 0.1 to 0.9. It is intended to make default-threshold trade-offs easier to review, especially for imbalanced binary classification tasks.
Dataset paths are resolved relative to the task file. Output paths are resolved relative to the current working directory unless run.output_dir is absolute. CLI --output is treated as the exact output directory and overrides run.output_dir/run.name.
run:
name: churn_baseline
output_dir: outputs
random_seed: 42
data:
path: churn.csv
format: csv
target: churn
ignore_columns:
- customer_id
task:
type: binary_classification
leakage_policy: warn
split:
test_size: 0.25
stratify: true
modeling:
positive_label: 1
selection_metric: roc_aucleakage_policy defaults to warn. With warn, high severity leakage candidates are shown in the CLI summary, leakage_report.md, and final_report.md, but the workflow continues. With fail, high severity leakage candidates stop the workflow before model training. Medium and low severity leakage candidates are reported for review but do not trigger a policy failure.
The workflow reports class balance using positive_count, negative_count, positive_rate, and majority_class_baseline_accuracy. When the positive rate is below 0.2 or above 0.8, the CLI summary and reports warn that accuracy may be misleading.
If your GitHub CLI has the gh skill command available, you can install and pin the public skill release directly. gh skill may still be preview or unavailable in some environments; if it is not available, use the adapter scripts below.
gh skill install howardxie-dev/ml-agent-skills tabular-ml-lab --agent codex --pin v0.2.1
gh skill install howardxie-dev/ml-agent-skills tabular-ml-lab --agent claude-code --pin v0.2.1Pinning to v0.2.1 is recommended for repeatable agent behavior. Avoid blindly tracking main for shared or long-running workflows.
Install the tabular-ml-lab skill into a Codex-compatible workspace:
bash adapters/codex/install.sh /path/to/workspaceInstall the skill into a Claude Code-compatible workspace:
bash adapters/claude-code/install.sh /path/to/workspacePowerShell installers are also available:
powershell -ExecutionPolicy Bypass -File adapters\codex\install.ps1 -WorkspaceRoot C:\path\to\workspace
powershell -ExecutionPolicy Bypass -File adapters\claude-code\install.ps1 -WorkspaceRoot C:\path\to\workspaceThe adapter scripts copy only skills/tabular-ml-lab/. Python dependencies are managed by this repository's pyproject.toml and uv.lock.
This project runs local Python scripts. Review the skill files and scripts before installing or executing them in your workspace.
Do not use sensitive data unless you understand your agent, shell, and tool data-sharing settings. The workflow is local-first, but agents and connected tools may have their own logging, telemetry, or sharing behavior depending on your environment.
For repeatable installs, prefer a pinned release tag such as v0.2.1 instead of tracking main.
agent-skillsllm-agentsmachine-learningcodexclaude-codeautomltabular-mlreproducible-mldata-leakagemodel-card
- No Web UI
- No browser or Jupyter control
- No full AutoML search
- No deployment pipeline
- No monitoring
- No claim that generated models are production-ready
MIT