English | 中文
AutoSkill4Doc is the standalone offline document-to-skill engine for this repository.
It turns documents into executable skills through explicit offline stages, while
keeping provenance, version history, and incremental re-run support.
Current pipeline:
document
-> ingest_document (DocumentRecord + TextUnit + StrictWindow)
-> extract_skills (SupportRecord + SkillDraft)
-> compile_skills (SkillSpec)
-> register_versions
-> registry + staged snapshots + final SkillBank store + visible domain/family/level skill tree
Core layers:
DocumentRecordTextUnitStrictWindowSupportRecordSkillDraftSkillSpec
- standalone CLI:
autoskill4doc ...orpython -m AutoSkill4Doc ... - incremental skip by content hash
- section filtering + strict/recommended windowing
- document-level parallel extraction via
--extract-workers, while each document still extracts windows sequentially and final outputs keep input document order - in non-
--quietCLI runs, the CLI now shows one live single-line multi-stage progress view: extraction reports document and window progress, while later stages report compile-group and register-skill progress - transient extraction failures such as rate limits or short-lived overload now retry with exponential backoff instead of dropping the document immediately; tune with
--extract-retriesand--extract-retry-backoff-s - if one document still fails after retries, AutoSkill4Doc records that failure at the document level and continues the rest of the batch instead of aborting the whole extraction run
- extraction now uses a two-stage window flow: first plan how many distinct skills the window really contains, then expand each planned skill one-by-one so multi-skill windows do not force several full skills to share one response budget
- by default, AutoSkill4Doc uses one compact outline-level LLM classification pass per document over rule-recalled heading candidates to assign section/subsection levels; use
section-outline-mode=ruleonly when you explicitly want rules only - dry-run and stage-by-stage execution
- support-backed provenance and change logs
- lifecycle-aware versioning:
candidate -> draft -> evaluating -> active -> watchlist -> deprecated -> retired - visible domain/family/level skill tree under the document skill library root
- incremental intermediate snapshots under
.runtime/intermediate_runs/<run_id>/during non-dry-run builds - persisted retrieval cache under
.runtime/store/document_skill_retrieval.jsonfor retrieval texts, vectors, and BM25 token index - configurable skill taxonomy via built-in or custom YAML files, with
domain_typesupplied by the caller rather than predicted by the model
Input notes:
- text / markdown / json / jsonl are read directly
.doc/.docxrequire local conversion tools such astextutil,antiword, orcatdoc- unsupported binary files such as images or PDFs are currently skipped rather than decoded as noisy text
genericLLM / embedding backends require explicitAUTOSKILL_GENERIC_LLM_URL/AUTOSKILL_GENERIC_EMBED_URL- generated visible-tree artifacts under
总技能/,一级技能/,二级技能/,微技能/,Family技能/, andreferences/are skipped during ingest so exported skills are not re-extracted as source documents
Default library root:
<repo_root>/SkillBank/DocSkill/
Default registry root:
<store_root>/.runtime/document_registry/
Default intermediate run root:
<store_root>/.runtime/intermediate_runs/
Visible output tree:
<store_root>/
README.md
<domain_root>/
总技能/
SKILL.md
references/
domain_manifest.json
Family技能/
<family_name>/
总技能/
SKILL.md
references/
children_manifest.json
children_map.md
一级技能/
<skill_name>/
SKILL.md
references/
children_manifest.json
children_map.md
evidence.md
evidence_manifest.json
二级技能/
<skill_name>/
SKILL.md
references/
children_manifest.json
children_map.md
evidence.md
evidence_manifest.json
微技能/
<skill_name>/
SKILL.md
references/
evidence.md
evidence_manifest.json
Users/
docskill/
...
.runtime/
document_registry/
intermediate_runs/
store/
document_skill_retrieval.json
staging/
library_manifest.json
Storage layers under the same library root:
.runtime/document_registry/- internal document/support/skill/version records
Users/<internal_user>/- final AutoSkill local-store skills synced directly from reconciled
SkillSpecrecords
- final AutoSkill local-store skills synced directly from reconciled
<domain_root>/- visible domain root, family root, and
一级技能 / 二级技能 / 微技能projection built for browsing and export
- visible domain root, family root, and
These layers share one root, but they are not the same dataset. During long runs,
non-dry-run builds also write incremental snapshots under
.runtime/intermediate_runs/<run_id>/. If a run crashes after ingest or
extract, rerunning the same input/config automatically reuses the latest
unfinished run and resumes from the persisted stage snapshots instead of
starting extraction from scratch. Per-document extract failures are also
persisted under extract/documents/<doc_id>.json, so a long batch can finish
with partial success plus explicit failed-document records instead of failing
all-or-nothing.
The current implementation does not try to emit a whole visible skill tree in a single extraction step. It works in two layers:
- The document pipeline first produces canonical
SkillSpecrecordsingestbuildsStrictWindow- markdown headings plus numbered / chapter-style headings such as
3,3.1,第3章,(一) - window planning groups subsections under their top-level chapter, so
4.1 / 4.2 / 5.1style blocks are extracted under4 .../5 ...root sections rather than treated as separate top-level units - hierarchy metadata (
heading_path,parent_heading,sibling_headings,subsection_headings) is attached to each window - rules first recall candidate heading lines; when an LLM is available, AutoSkill4Doc can do one compact outline LLM pass per document to classify those candidates into section/subsection levels
- long sections are pre-split before final window planning; default
--max-section-charsis10000 - bibliography / reference-heavy sections are skipped before extraction
- markdown headings plus numbered / chapter-style headings such as
extractfirst plans how many skills a window contains and then expands each planned skill individually intoSupportRecord + SkillDraftcompileturns drafts intoSkillSpec
register_versionsretrieves top-k similar existing skills with hybrid embedding + BM25 scoring over metadata-rich skill text, then decides create / strengthen / revise / merge / split / unchanged before persisting registry state and lifecycle updates- retrieval texts, corpus vectors, and BM25 tokenized docs are persisted under
.runtime/store/document_skill_retrieval.jsonand reused on the next run
- The visible domain/family tree is then projected for browsing/export
- if final store skills are available, the visible family skills are rebuilt from the reconciled
Users/<internal_user>/...store results - registry records are still used to stitch
references/evidence.mdandreferences/evidence_manifest.json - one domain-root navigation skill is synthesized per
domain_root - one family-root navigation skill is synthesized per
family_name - level-1 / level-2 parent skills can also receive nested
children_manifest.jsonandchildren_map.mdwhen linked child skills exist
- if final store skills are available, the visible family skills are rebuilt from the reconciled
This is intentional:
- the source of truth stays in document/support/skill registry layers
- the domain/family parent skills remain a navigation layer rather than raw truth
- the visible tree can be rebuilt after updates without manual drift
- the visible family tree now prefers final store-reconciled skills so it stays aligned with
Users/<internal_user>/... - nested parent skills can carry child routing sections without changing registry truth
To keep the visible layout stable, the most important flag is:
--family-name
It determines the family subtree under <domain_root>/Family技能/.
If --profile-id is omitted, AutoSkill4Doc now derives one from the selected
taxonomy plus family_name. If --taxonomy-axis is omitted, the selected
taxonomy may provide a default axis label. If --family-name is omitted,
AutoSkill4Doc first tries taxonomy rule matching and then one constrained LLM
classification pass over the configured family_candidates.
--user-id is now treated as an internal store-routing detail and is no longer
part of the normal documented workflow. If omitted, AutoSkill4Doc uses the
neutral internal user id docskill.
AutoSkill4Doc keeps a small stable internal asset_type set for compile/versioning:
macro_protocolsession_skillmicro_skillsafety_ruleknowledge_reference
On top of that, extraction can load a configurable skill taxonomy:
- built-in via
--domain-type psychology/--domain-type chemistry - custom via
--skill-taxonomy /path/to/taxonomy.yaml
Important:
domain_typeis caller-provided configuration, not model output- the model still returns the stable internal
asset_type - taxonomy files provide:
- domain-specific labels, aliases, guidance, and hierarchy nodes
- optional default
family_name - optional default
taxonomy_axis - optional family candidates for future constrained family resolution
- optional
asset_treeandvisible_levelsfor multi-level family trees
Example:
python3 -m AutoSkill4Doc llm-extract \
--file ./chem_docs \
--domain chemistry \
--domain-type chemistry \
--family-name "分析化学" \
--skill-taxonomy ./custom-taxonomy.yaml \
--store-path ./SkillBank/DocSkillAutoSkill4Doc currently uses three configuration layers:
- Built-in taxonomy files
- One optional user taxonomy file passed via
--skill-taxonomy - Runtime CLI arguments and provider env vars
How taxonomy loading works:
default.yamlis always loaded first- if
--domain-type psychology/chemistryis set, the matching built-in file is overlaid on top ofdefault.yaml - if
--skill-taxonomy /path/to/file.yamlis set, that file is used as the overlay instead of the built-in domain file - the final resolved values then feed:
- extraction prompt guidance
asset_typealias normalizationasset_node_idhierarchy-node normalization- default
family_name - default
taxonomy_axis - auto-derived
profile_id
Important taxonomy fields:
taxonomy_id: stable taxonomy id used when derivingprofile_iddomain_type: the externally supplied domain type namedisplay_name: human-readable labeldefault_base_type: fallback internalasset_typefamily_axis: default visible axis label, such as疗法or实验路线default_family_name: fallback visible family namefamily_candidates: optional constrained candidate set for future family resolutionasset_types: domain labels mapped back to stable internal base typesvisible_levels: visible labels such as总技能 / 一级技能 / 二级技能 / 微技能asset_tree: configuration-driven hierarchy nodes; the model may emitasset_node_id, and the pipeline uses it to constrain same-level merge and later parent-child linking
asset_types and asset_tree are intentionally different layers:
asset_typesdefine stable coarse types such assession_skillormicro_skillasset_treedefines the visible hierarchy and parent/child layout such asfamily_root -> 一级技能 -> 二级技能 -> 微技能
Minimal taxonomy example:
taxonomy_id: psychology
domain_type: psychology
display_name: Psychology
default_base_type: session_skill
family_axis: 疗法
default_family_name: 通用心理咨询
family_candidates:
- id: cbt
name: CBT(认知行为疗法)
aliases: ["CBT", "认知行为疗法", "cognitive behavioral therapy"]
visible_levels:
root_label: 总技能
level_labels:
"1": 一级技能
"2": 二级技能
"3": 微技能
asset_types:
- base_type: session_skill
label: session_intervention
description: One counseling workflow or session scaffold.
aliases: ["session_intervention", "session_skill"]
asset_tree:
- id: session_framework
label_zh: 二级技能
label_en: Second-Level Skill
base_type: session_skill
level: 2
parent: treatment_framework
visible_role: parent
default_for_base_type: trueOther configuration sources:
- AutoSkill4Doc/core/config.py
- code defaults such as default store path, runtime path, extract strategy, section pre-split size, and section-outline fallback mode
- AutoSkill4Doc/core/provider_config.py
- provider/env resolution for
dashscope,glm,openai,anthropic, andgeneric
- provider/env resolution for
Provider config is environment-variable based rather than file-based. Common examples:
- DashScope:
DASHSCOPE_API_KEY, optionalDASHSCOPE_MODEL,DASHSCOPE_EMBED_MODEL - GLM:
ZHIPUAI_API_KEYorBIGMODEL_API_KEY - Generic backend:
AUTOSKILL_GENERIC_LLM_URL,AUTOSKILL_GENERIC_EMBED_URL
Important:
- user-facing extraction commands (
build,llm-extract,extract,compile,diag) no longer silently fall back tomock - if no real provider can be resolved, AutoSkill4Doc exits with an error instead of running with test-only mock behavior
mockremains available only for development/testing paths
Resolution priority:
- explicit CLI argument
- custom taxonomy file
- built-in taxonomy file
- code default in
core/config.py - provider env vars for backend credentials and endpoint URLs
Parsing / hierarchy knobs:
--max-section-chars- pre-splits one oversized detected section before final window construction
- default:
10000
--section-outline-mode llm|rulellm: default; do one compact outline LLM classification pass over candidate headings for each document, while rules only recall candidates and provide fallbackrule: disable outline LLM classification completely and rely on rules only- compatibility aliases:
auto -> llm,off -> rule
For the current MVP, yes:
- extraction and visible layout generation are decoupled
- parent/child directories are a projection, not the only truth layer
- rebuilding one family directory from final store output plus registry evidence avoids manual drift
But this is still simpler than the full paper target:
- the domain/family parent skills are currently synthesized from active child skills
- the visible tree already matches the target directory shape
- the full
single-document standardization + canonical merge + parent synthesisquality pipeline is not fully implemented yet .runtime/document_registry/may still contain more internal skill records than the finalUsers/<internal_user>/...store and visible<domain_root>/Family技能/<family_name>/...tree
Stored entities:
documentssupportsskillslifecyclesversion_historychange_logsprovenance_links
Standalone CLI:
python3 -m AutoSkill4Doc build --file ./paper.md --dry-run
python3 -m AutoSkill4Doc llm-extract --file ./cbt_docs --family-name "认知行为疗法"
python3 -m AutoSkill4Doc ingest --file ./docs/ --json
python3 -m AutoSkill4Doc extract --file ./paper.md --json
autoskill4doc compile --file ./paper.md --json
python3 -m AutoSkill4Doc diag --file ./paper.md --report-path ./diag.jsonl --json
python3 -m AutoSkill4Doc retrieve-hierarchy --store-path ./SkillBank/DocSkill --profile-id psychology::认知行为疗法 --family-name "认知行为疗法" --json
python3 -m AutoSkill4Doc canonical-merge --store-path ./SkillBank/DocSkill --family-name "认知行为疗法" --json
python3 -m AutoSkill4Doc migrate-layout --store-path ./SkillBank/DocSkill --json
python3 -m AutoSkill4Doc build \
--file ./cbt_docs/ \
--domain psychology \
--domain-type psychology \
--family-name "认知行为疗法" \
--store-path ./SkillBank/DocSkillNotes:
dry-runruns ingest/extract/compile for inspection but does not write final registry/store/visible-tree results.diagalways runs in non-persisting dry-run mode.- non-
dry-runbuild/llm-extractwrites ingest/extract/compile/register snapshots to.runtime/intermediate_runs/<run_id>/. - the same non-
dry-runbuild also persists retrieval texts/vectors/BM25 tokens under.runtime/store/document_skill_retrieval.json. - rerunning the same non-
dry-runbuild input/config resumes the latest unfinished intermediate run automatically; already extracted documents are not extracted again. retrieve-hierarchynow opens the family directly when the library contains only one visible family.canonical-mergecurrently inspects staged results. When staging contains one unique bucket, it can inferprofile_id,family_name, andchild_type; otherwise pass them explicitly.- when
--family-nameis omitted, family resolution first uses configured aliases/keywords, then one constrained LLM classification pass, and finally falls back to the taxonomy'sdefault_family_nameinstead of the first configured candidate. - when one batch contains documents that strongly match different configured families, the batch-level fallback becomes the taxonomy default family; document-level family metadata is still preserved on extracted skills so later versioning and visible-tree projection can keep specific family signals.
from AutoSkill4Doc import extract_from_doc
result = extract_from_doc(
sdk=sdk,
file_path="./paper.md",
domain="psychology",
domain_type="psychology",
family_name="认知行为疗法",
dry_run=True,
)Stage-level orchestration:
from AutoSkill4Doc.pipeline import build_default_document_pipeline
pipeline = build_default_document_pipeline(sdk=sdk)
ingest = pipeline.ingest_document(file_path="./paper.md", dry_run=True)
print(len(ingest.windows))
extracted = pipeline.extract_skills(documents=ingest.documents, windows=ingest.windows)
compiled = pipeline.compile_skills(
skill_drafts=extracted.skill_drafts,
support_records=extracted.support_records,
)extract.py/__main__.py: standalone package CLI + API entrypointpipeline.py: staged orchestrationingest.py: document normalization and incremental checkstaxonomy.py: built-in/custom skill taxonomy loading plus family/domain hierarchy metadatafamily_resolver.py: constrained family classification over configuredfamily_candidatesdocument/file_loader.py: directory/file loading, conversion fallback, and generated-artifact skippingdocument/windowing.py: section filtering and strict/recommended window constructionstages/extractor.py:DocumentRecord -> SupportRecord[] + SkillDraft[]stages/compiler.py:SkillDraft[] -> SkillSpec[]stages/diag.py: dry-run diagnostic reporting over extraction windowsstages/hierarchy.py: manifest-first visible hierarchy browse/searchstages/merge.py: staging-backed canonical merge inspectionstages/migrate.py: safe runtime layout preparationstore/versioning.py: skill-centric version/lifecycle reconciliationstore/registry.py: filesystem registry persistencestore/visible_tree.py: visible领域总技能 / Family技能 / 一级技能 / 二级技能 / 微技能 / referencesexport, rebuilt from final store skills plus registry evidencestore/intermediate.py: incremental per-run ingest/extract/compile/register snapshotsstore/layout.py: shared visible/runtime path conventionsstore/staging.py: canonical-merge staging payload helperscore/config.py: standalone AutoSkill4Doc defaults and pathscore/provider_config.py: standalone provider configuration helpersmodels.py: core data modelsprompts.py: offline prompt builders plus runtime prompt switching
- This module replaced the old
autoskill/offline/document/implementation. - Document extraction now runs through
AutoSkill4Docdirectly rather thanautoskill/offline. python -m AutoSkill4Docis the recommended module entry;extract.pyis kept as the implementation module behind that CLI.