Skip to content

feat(agent): interaction quality ops & recipe, bad-case HTML report, and robust JSONL / HF meta loading#957

Open
yxdyc wants to merge 39 commits intomainfrom
agent_op_dev
Open

feat(agent): interaction quality ops & recipe, bad-case HTML report, and robust JSONL / HF meta loading#957
yxdyc wants to merge 39 commits intomainfrom
agent_op_dev

Conversation

@yxdyc
Copy link
Copy Markdown
Collaborator

@yxdyc yxdyc commented Mar 25, 2026

Summary
This branch adds an agent / dialog interaction quality path (mappers, signals, optional LLM insight), a bad-case HTML report and supporting demos/tooling, and hardens JSONL loading and Hugging Face datasets usage for large integers / mixed folders. upstream/main is merged in so the branch is up to date for landing on main.

Agent & bad-case

  • End-to-end recipe and minimal_configs smoke paths for agent-facing pipelines.
  • Bad-case triage: structured signals, tiers, cohort-oriented reporting; HTML report (charts, drill-down, signal attribution).
  • Dialog / trace quality LLM mappers and related tests; multi-turn tool handling, history caps, and prompt truncation where relevant.
  • Report UX: macro distributions (tools, skills, intent/topic/sentiment), optional word clouds; insight sections omit PII-audit / redaction-heavy rows to avoid misleading narrative in excerpts (exports/drill-down unchanged).
  • agent_skill_insight_mapper: prompt tuned for short, concrete capability phrases (≈10 CJK chars / 4–8 English words) instead of vague “read/write / process” tags.

Data loading & config

  • Lenient JSONL: stdlib json, skip bad lines; avoid falling back to HF loader when a directory mixes .json and JSONL-style use cases.
  • HF datasets JSONL: optional stdlib json patch to avoid ujson “Value is too big” and related large integer issues.
    Config / CLI: align with upstream build_base_parser() refactor after merging upstream/main.

Tests & docs

  • Tests for agent ops, bad-case/report smoke, dataset loading, and locale/prompt helpers where applicable.
  • Demos/agent README / maintainer notes, root README quick link, YAML / insights docs updates.

yxdyc and others added 30 commits March 16, 2026 15:44
Made-with: Cursor
…igs conflicts (keep qwen-turbo + full 07 entity config)

Made-with: Cursor
* preliminary test of minimal_configs (01 to 05)

Made-with: Cursor

* test of minimal_configs (06 to 08), optimize some ops

Made-with: Cursor

* conflits resolved and gemini'suggestion adopted

Made-with: Cursor
…ested on small-scale samples (#949)

* preliminary test of minimal_configs (01 to 05)

Made-with: Cursor

* test of minimal_configs (06 to 08), optimize some ops

Made-with: Cursor

* conflits resolved and gemini'suggestion adopted

Made-with: Cursor

* end-to-end yaml, analysis toolchain, ui developed; tested on small-scale samples

Made-with: Cursor
…truncation

- Normalize LLM recommendation to list[str] in parse_output (fixes HF datasets shard align)
- agent_dialog_normalize_mapper: configurable history caps, head+tail write-back, meta flag
- dialog_* mappers: shared max_*_chars_for_prompt via dialog_llm_input_utils
- Recipe/docs: agent_interaction_quality_analysis, PERFORMANCE_LLM, BAD_CASE_INSIGHTS
- Tests: agent_dialog_normalize_mapper, llm_analysis_filter parse_output
- build_op_doc: exclude dialog_llm_input_utils helper; video_camera_pose droid_args docstring

Made-with: Cursor
…er + evidence, dialog history caps (#950)

* preliminary test of minimal_configs (01 to 05)

Made-with: Cursor

* test of minimal_configs (06 to 08), optimize some ops

Made-with: Cursor

* conflits resolved and gemini'suggestion adopted

Made-with: Cursor

* end-to-end yaml, analysis toolchain, ui developed; tested on small-scale samples

Made-with: Cursor

* fix(agent): multi-turn tool dialog, bad-case gating, report zh-tier + evidence

Made-with: Cursor

* fix(agent): llm_quality record schema + dialog history caps & prompt truncation

- Normalize LLM recommendation to list[str] in parse_output (fixes HF datasets shard align)
- agent_dialog_normalize_mapper: configurable history caps, head+tail write-back, meta flag
- dialog_* mappers: shared max_*_chars_for_prompt via dialog_llm_input_utils
- Recipe/docs: agent_interaction_quality_analysis, PERFORMANCE_LLM, BAD_CASE_INSIGHTS
- Tests: agent_dialog_normalize_mapper, llm_analysis_filter parse_output
- build_op_doc: exclude dialog_llm_input_utils helper; video_camera_pose droid_args docstring

Made-with: Cursor
Bad-case report (generate_bad_case_report.py):
- CJK fonts for matplotlib/body; bar labels; section order (charts → insights → cases)
- LLM page-top summary: compact digest, shorter prompt/tokens/timeout; default qwen3.5-plus
- Drilldown: page cap + sidecar *_drilldown_full.jsonl; copy and nav tweaks
- Richer agent_insight_llm cards; rule-based fallback summary

agent_dialog_normalize_mapper:
- Stable HF Arrow meta: always agent_dialog_history_compressed bool; list[str] placeholders
  for empty tool/skill types; filter falsy in tool_type_mapper and skill_insight_mapper

Pipeline: run_bad_case_pipeline report uses argv array safe under set -u; BAD_CASE_REPORT_LLM=1

Tests + recipe yaml aligned.

Made-with: Cursor
#951)

* preliminary test of minimal_configs (01 to 05)

Made-with: Cursor

* test of minimal_configs (06 to 08), optimize some ops

Made-with: Cursor

* conflits resolved and gemini'suggestion adopted

Made-with: Cursor

* end-to-end yaml, analysis toolchain, ui developed; tested on small-scale samples

Made-with: Cursor

* fix(agent): multi-turn tool dialog, bad-case gating, report zh-tier + evidence

Made-with: Cursor

* fix(agent): llm_quality record schema + dialog history caps & prompt truncation

- Normalize LLM recommendation to list[str] in parse_output (fixes HF datasets shard align)
- agent_dialog_normalize_mapper: configurable history caps, head+tail write-back, meta flag
- dialog_* mappers: shared max_*_chars_for_prompt via dialog_llm_input_utils
- Recipe/docs: agent_interaction_quality_analysis, PERFORMANCE_LLM, BAD_CASE_INSIGHTS
- Tests: agent_dialog_normalize_mapper, llm_analysis_filter parse_output
- build_op_doc: exclude dialog_llm_input_utils helper; video_camera_pose droid_args docstring

Made-with: Cursor

* feat(agent): bad-case HTML report UX + HF meta stability for normalize

Bad-case report (generate_bad_case_report.py):
- CJK fonts for matplotlib/body; bar labels; section order (charts → insights → cases)
- LLM page-top summary: compact digest, shorter prompt/tokens/timeout; default qwen3.5-plus
- Drilldown: page cap + sidecar *_drilldown_full.jsonl; copy and nav tweaks
- Richer agent_insight_llm cards; rule-based fallback summary

agent_dialog_normalize_mapper:
- Stable HF Arrow meta: always agent_dialog_history_compressed bool; list[str] placeholders
  for empty tool/skill types; filter falsy in tool_type_mapper and skill_insight_mapper

Pipeline: run_bad_case_pipeline report uses argv array safe under set -u; BAD_CASE_REPORT_LLM=1

Tests + recipe yaml aligned.

Made-with: Cursor
- Add DATA_JUICER_USE_STDLIB_JSON env patch in init_configs
- Document workaround in config_all.yaml and DatasetCfg guides

Made-with: Cursor
* preliminary test of minimal_configs (01 to 05)

Made-with: Cursor

* test of minimal_configs (06 to 08), optimize some ops

Made-with: Cursor

* conflits resolved and gemini'suggestion adopted

Made-with: Cursor

* end-to-end yaml, analysis toolchain, ui developed; tested on small-scale samples

Made-with: Cursor

* fix(agent): multi-turn tool dialog, bad-case gating, report zh-tier + evidence

Made-with: Cursor

* fix(agent): llm_quality record schema + dialog history caps & prompt truncation

- Normalize LLM recommendation to list[str] in parse_output (fixes HF datasets shard align)
- agent_dialog_normalize_mapper: configurable history caps, head+tail write-back, meta flag
- dialog_* mappers: shared max_*_chars_for_prompt via dialog_llm_input_utils
- Recipe/docs: agent_interaction_quality_analysis, PERFORMANCE_LLM, BAD_CASE_INSIGHTS
- Tests: agent_dialog_normalize_mapper, llm_analysis_filter parse_output
- build_op_doc: exclude dialog_llm_input_utils helper; video_camera_pose droid_args docstring

Made-with: Cursor

* feat(agent): bad-case HTML report UX + HF meta stability for normalize

Bad-case report (generate_bad_case_report.py):
- CJK fonts for matplotlib/body; bar labels; section order (charts → insights → cases)
- LLM page-top summary: compact digest, shorter prompt/tokens/timeout; default qwen3.5-plus
- Drilldown: page cap + sidecar *_drilldown_full.jsonl; copy and nav tweaks
- Richer agent_insight_llm cards; rule-based fallback summary

agent_dialog_normalize_mapper:
- Stable HF Arrow meta: always agent_dialog_history_compressed bool; list[str] placeholders
  for empty tool/skill types; filter falsy in tool_type_mapper and skill_insight_mapper

Pipeline: run_bad_case_pipeline report uses argv array safe under set -u; BAD_CASE_REPORT_LLM=1

Tests + recipe yaml aligned.

Made-with: Cursor

* fix: optional stdlib json for HF datasets JSONL (ujson Value too big)

- Add DATA_JUICER_USE_STDLIB_JSON env patch in init_configs
- Document workaround in config_all.yaml and DatasetCfg guides

Made-with: Cursor
- Add load_jsonl_lenient config and DATA_JUICER_JSONL_LENIENT env
- Stream jsonl-only inputs via Dataset.from_generator; document in DatasetCfg
- Add unit tests for jsonl_lenient_loader

Made-with: Cursor
* preliminary test of minimal_configs (01 to 05)

Made-with: Cursor

* test of minimal_configs (06 to 08), optimize some ops

Made-with: Cursor

* conflits resolved and gemini'suggestion adopted

Made-with: Cursor

* end-to-end yaml, analysis toolchain, ui developed; tested on small-scale samples

Made-with: Cursor

* fix(agent): multi-turn tool dialog, bad-case gating, report zh-tier + evidence

Made-with: Cursor

* fix(agent): llm_quality record schema + dialog history caps & prompt truncation

- Normalize LLM recommendation to list[str] in parse_output (fixes HF datasets shard align)
- agent_dialog_normalize_mapper: configurable history caps, head+tail write-back, meta flag
- dialog_* mappers: shared max_*_chars_for_prompt via dialog_llm_input_utils
- Recipe/docs: agent_interaction_quality_analysis, PERFORMANCE_LLM, BAD_CASE_INSIGHTS
- Tests: agent_dialog_normalize_mapper, llm_analysis_filter parse_output
- build_op_doc: exclude dialog_llm_input_utils helper; video_camera_pose droid_args docstring

Made-with: Cursor

* feat(agent): bad-case HTML report UX + HF meta stability for normalize

Bad-case report (generate_bad_case_report.py):
- CJK fonts for matplotlib/body; bar labels; section order (charts → insights → cases)
- LLM page-top summary: compact digest, shorter prompt/tokens/timeout; default qwen3.5-plus
- Drilldown: page cap + sidecar *_drilldown_full.jsonl; copy and nav tweaks
- Richer agent_insight_llm cards; rule-based fallback summary

agent_dialog_normalize_mapper:
- Stable HF Arrow meta: always agent_dialog_history_compressed bool; list[str] placeholders
  for empty tool/skill types; filter falsy in tool_type_mapper and skill_insight_mapper

Pipeline: run_bad_case_pipeline report uses argv array safe under set -u; BAD_CASE_REPORT_LLM=1

Tests + recipe yaml aligned.

Made-with: Cursor

* fix: optional stdlib json for HF datasets JSONL (ujson Value too big)

- Add DATA_JUICER_USE_STDLIB_JSON env patch in init_configs
- Document workaround in config_all.yaml and DatasetCfg guides

Made-with: Cursor

* feat: lenient JSONL load (stdlib json, skip bad lines)

- Add load_jsonl_lenient config and DATA_JUICER_JSONL_LENIENT env
- Stream jsonl-only inputs via Dataset.from_generator; document in DatasetCfg
- Add unit tests for jsonl_lenient_loader

Made-with: Cursor
Mixed extensions previously forced HuggingFace JSON loader and ujson
(Value too big). Now only jsonl* shards are read; others are skipped
with warnings. Log line [lenient jsonl] ACTIVE confirms the path.

Made-with: Cursor
)

* preliminary test of minimal_configs (01 to 05)

Made-with: Cursor

* test of minimal_configs (06 to 08), optimize some ops

Made-with: Cursor

* conflits resolved and gemini'suggestion adopted

Made-with: Cursor

* end-to-end yaml, analysis toolchain, ui developed; tested on small-scale samples

Made-with: Cursor

* fix(agent): multi-turn tool dialog, bad-case gating, report zh-tier + evidence

Made-with: Cursor

* fix(agent): llm_quality record schema + dialog history caps & prompt truncation

- Normalize LLM recommendation to list[str] in parse_output (fixes HF datasets shard align)
- agent_dialog_normalize_mapper: configurable history caps, head+tail write-back, meta flag
- dialog_* mappers: shared max_*_chars_for_prompt via dialog_llm_input_utils
- Recipe/docs: agent_interaction_quality_analysis, PERFORMANCE_LLM, BAD_CASE_INSIGHTS
- Tests: agent_dialog_normalize_mapper, llm_analysis_filter parse_output
- build_op_doc: exclude dialog_llm_input_utils helper; video_camera_pose droid_args docstring

Made-with: Cursor

* feat(agent): bad-case HTML report UX + HF meta stability for normalize

Bad-case report (generate_bad_case_report.py):
- CJK fonts for matplotlib/body; bar labels; section order (charts → insights → cases)
- LLM page-top summary: compact digest, shorter prompt/tokens/timeout; default qwen3.5-plus
- Drilldown: page cap + sidecar *_drilldown_full.jsonl; copy and nav tweaks
- Richer agent_insight_llm cards; rule-based fallback summary

agent_dialog_normalize_mapper:
- Stable HF Arrow meta: always agent_dialog_history_compressed bool; list[str] placeholders
  for empty tool/skill types; filter falsy in tool_type_mapper and skill_insight_mapper

Pipeline: run_bad_case_pipeline report uses argv array safe under set -u; BAD_CASE_REPORT_LLM=1

Tests + recipe yaml aligned.

Made-with: Cursor

* fix: optional stdlib json for HF datasets JSONL (ujson Value too big)

- Add DATA_JUICER_USE_STDLIB_JSON env patch in init_configs
- Document workaround in config_all.yaml and DatasetCfg guides

Made-with: Cursor

* feat: lenient JSONL load (stdlib json, skip bad lines)

- Add load_jsonl_lenient config and DATA_JUICER_JSONL_LENIENT env
- Stream jsonl-only inputs via Dataset.from_generator; document in DatasetCfg
- Add unit tests for jsonl_lenient_loader

Made-with: Cursor

* fix(lenient jsonl): do not fall back to HF when folder mixes .json

Mixed extensions previously forced HuggingFace JSON loader and ujson
(Value too big). Now only jsonl* shards are read; others are skipped
with warnings. Log line [lenient jsonl] ACTIVE confirms the path.

Made-with: Cursor
* use list type for the arg to avoid ckpt failure
…ests

- Add dialog_* LLM axis mappers, trace coherence, tool relevance, PII suspect
- agent_output_locale; extend bad-case signals, insight & usage/tool mappers
- generate_bad_case_report: TOC/sidebar, insight↔drill links, snapshot 算子 row
- Recipe/docs/Operators.md/pyproject; mapper & locale tests
- build_op_doc: exclude dialog_quality_llm_utils (helper, not an OP)

Made-with: Cursor
- HTML report: macro distributions (tools, skills, intent/topic/sentiment) with bar charts and optional word clouds; TOC and chart section wiring.

- Omit PII audit / redaction–related samples from high_precision and watchlist insight excerpts (drilldown/export unchanged).

- agent_skill_insight_mapper: prompt asks for concrete ~10-char (zh) / 4–8-word (en) capability phrases; forbid vague read/write–style tags.

- Docs: root README link to demos/agent; maintainer checklist in demos/agent README; YAML/minimal_configs notes.

- Tests: generate_bad_case_report smoke (PII omission); agent_skill_insight prompt assertions.

Made-with: Cursor
upstream: https://github.com/datajuicer/data-juicer.git

Resolve config.py conflict by using build_base_parser(); take upstream uv.lock (validated with uv lock --check).

Made-with: Cursor
- agent_skill_insight_mapper: split labels on ,,、;; for CN/EN separators
- generate_bad_case_report: mirror split in macro stats (no re-run required)
- Optional semantic clustering for insight headlines/audit (scikit-learn)
- Insight model tabs: default full model id, family mode flag; order by batch volume
- Stack request_model chart: Top 5 by requests + merged remainder bar
- Extend model family hints (Kimi, GLM, MiniMax); tab/chart copy updates
- Smoke tests: PII omission, skill-insight macro split; --no-insight-semantic-cluster in PII run

Made-with: Cursor
tier = "high_precision"
elif len(mediums) >= self.min_medium_signals_for_watchlist:
tier = "watchlist"
elif len(signals) == 1 and signals[0].get("weight") == "medium":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When self.min_medium_signals_for_watchlist == 2, a sample with a single medium signal will be added to the watchlist. Not sure if it is intended.

p = sum(u.get("prompt_tokens") or 0 for u in usages)
c = sum(u.get("completion_tokens") or 0 for u in usages)
totals = [u.get("total_tokens") for u in usages if u.get("total_tokens") is not None]
t = totals[0] if totals else None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_tokens is taken from the first non-null entry. Given the limited information available about the raw data contract, it is unclear whether this always matches the intended aggregation logic.

# load_jsonl_lenient: true
# # or: DATA_JUICER_JSONL_LENIENT=1
load_dataset_kwargs: {} # extra kwargs passed to datasets.load_dataset(). Useful for format-specific options, e.g. chunksize (JSON), columns (Parquet), delimiter (CSV).
load_jsonl_lenient: false # if true, stream jsonl* shards with stdlib json and skip bad lines; other suffixes in the same folder are ignored (not HF fallback). Confirm logs contain "[lenient jsonl] ACTIVE".
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding --load_jsonl_lenient parser argument in config.py

Enables command-line passing and better external access for tools like dj-agents.

PII / redaction:
- Expand pii_redaction_mapper (PEM/JWT/URL/IP/MAC ordering, optional extended PII) with tests.
- pii_llm_suspect_mapper: spaCy install/locks, safer logging; prompts mention URL/IP/MAC/JWT/PEM leaks.

Reporting / demo:
- generate_bad_case_report: non-PII vs PII-flagged insight subsections; minimal PII cards;
  headline clusters use non-PII rows only; align redaction placeholders for grouping.
- agent_interaction_quality_analysis.yaml: pii_redaction indentation and default-behavior comment.

Recent branch history (already on upstream before this commit):
- OP doc build skips unregistered base classes; accelerator assignment fix;
  nested-query dict guard; bad-case report + skill insight parsing enrichments.

Made-with: Cursor
- Default writes safe HTML plus *_pii_audit.html; --report-pii-variants safe|audit|both

- Case study: ~half high_precision / half watchlist quota with spillover

- Reuse char TF-IDF + MiniBatchKMeans round-robin for Insight cards and case-study page

- Remove in-page PII minimal-card split; safe variant omits PII rows from insight + drill

- run_bad_case_pipeline.sh echoes audit path; smoke tests updated

Made-with: Cursor
- New section #sec-dialog-metrics: messages length, user turns, agent_turn_count,

  text chars, choices length, tool-touch message count, tokens (meta then stats), latency.

- Optional matplotlib histograms when --no-charts is off; TOC and charts intro link to section.

- Smoke test asserts sec-dialog-metrics anchor.

Made-with: Cursor
- Repeatable --input in generate_bad_case_report and verify; load_merged_rows reads paths in order.

- run_bad_case_pipeline report: multiple JSONL, optional trailing OUT.html.

- Multi-input: compact page meta and bottom #sec-data-provenance details for audit.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent related to agent dj:op issues/PRs about some specific OPs dj:post-tuning issues/PRs about post-tuning scenarios enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants