This document shows end-to-end examples of the QA Copilot. Each case lists the user query (input) and the assistant answer (output).
You can chat with our Q&A Copilot Juicer on the official documentation site of Data-Juicer! Feel free to ask Juicer anything related to Data-Juicer ecosystem.
- Type:
operator_qa - Language:
en
Input (user query)
Introduce alphanumeric_filter
Output (assistant answer)
I will retrieve the operator information for alphanumeric_filter to provide you with its details.
I will search for the alphanumeric_filter in the Data-Juicer codebase to find its implementation and documentation.
I will retrieve the contents of the alphanumeric_filter.md documentation file to provide a detailed introduction.
The alphanumeric_filter is a filter operator in Data-Juicer designed to retain text samples where the ratio of alphanumeric characters (or tokens) falls within a specified range. It is commonly used to clean datasets by removing samples that are either too sparse in alphanumeric content (e.g., mostly symbols or whitespace) or too dense (e.g., code-like or random character strings), depending on the configured thresholds.
- Metric Calculation: It calculates the ratio of alphanumeric characters to total characters. If tokenization is enabled, it calculates the ratio of alphanumeric tokens to total tokens using a Hugging Face tokenizer.
- Filtering Logic: Samples are kept if their ratio is between
min_ratioandmax_ratio. Samples outside this range are discarded. - Stats Caching: The computed ratio (
alnum_ratiooralpha_token_ratio) is cached in the sample's stats for transparency and reuse.
| Parameter | Type | Default | Description |
|---|---|---|---|
tokenization |
bool |
False |
If True, uses a Hugging Face tokenizer to compute the ratio based on tokens. If False, computes based on characters. |
min_ratio |
float |
0.25 |
The minimum acceptable alphanumeric ratio. Samples below this value are filtered out. |
max_ratio |
float |
sys.maxsize |
The maximum acceptable alphanumeric ratio. Samples above this value are filtered out. |
In a Data-Juicer recipe (YAML), you might configure it like this:
process:
- alphanumeric_filter:
min_ratio: 0.2
max_ratio: 0.9
tokenization: false-
Character-based filtering (
tokenization=False):- Input:
",。、„”“«»1」「《》´∶:?!();–—.~’…━〈〉【】%►"(Mostly symbols) - Result: Filtered out because the alphanumeric ratio is very low (below
min_ratio). - Input:
"Do you need a cup of coffee?" - Result: Kept because the ratio of letters/spaces to total characters is within the typical range.
- Input:
-
Token-based filtering (
tokenization=True):- This mode is useful for detecting samples where the semantic tokens are largely non-alphanumeric or when the tokenization behavior of specific models is relevant to the data quality.
For more details, you can refer to the official documentation or the source code.
- Type:
submodule_qa - Language:
en
Input (user query)
Introduce DJ-Agents
Output (assistant answer)
I'll help you learn about DJ-Agents (Data-Juicer Agents). Let me search for information about this project.
DJ-Agents is a suite of intelligent agents designed for agentic data processing, built on top of Data-Juicer and AgentScope. Its goal is to enable a development-free data processing lifecycle, allowing users to focus on what they want to do with their data rather than how to implement it.
Instead of writing complex data processing pipelines manually, DJ-Agents allows you to interact with your data through natural language conversations or high-level commands. The agents automatically plan, retrieve appropriate operators, and execute data processing tasks.
DJ-Agents is organized into four main layers:
- Surface Adapters: User-facing interfaces like the
djxCLI and the interactivedj-agentssession. - Capability Orchestration: End-to-end workflows such as planning, applying, and development.
- Tools: Atomic, reusable capabilities (e.g., operator retrieval, dataset inspection).
- Runtime Adapters: Integration with underlying frameworks like AgentScope.
This is the conversational interface where you can chat with an agent to process your data.
- TUI (Terminal User Interface): A rich interactive experience in your terminal.
- Plain Mode: Simple text-based interaction.
- AgentScope Studio: Integration with a visual studio for monitoring agent behavior.
- Workflow: The agent typically follows a path of
inspect_dataset -> retrieve_operators -> build_plan -> validate -> save.
A command-line tool for explicit, engineer-focused workflows:
djx retrieve: Find candidate Data-Juicer operators for a specific intent (e.g., "remove duplicate text").djx plan: Generate a complete data processing plan (YAML) based on a natural language description.djx apply: Execute a saved plan on your dataset.djx tool: Access atomic tools for automation or integration with other agents.
A specialized assistant deployed on the official Data-Juicer documentation site, DingTalk, and Discord to answer questions about the entire Data-Juicer ecosystem.
To get started, you need Python 3.10+ and an API key (DashScope or OpenAI-compatible).
- Install:
uv pip install -e '.[full]' - Configure Model:
export DASHSCOPE_API_KEY="<your_key>" export DJA_SESSION_MODEL="qwen3-max-2026-01-23"
- Interactive Processing:
dj-agents --dataset ./data/demo-dataset.jsonl --export ./data/output.jsonl
- CLI Planning:
djx plan "deduplicate and clean text for RAG" \ --dataset ./data/demo-dataset.jsonl \ --output ./data/demo-plan.yaml djx apply --plan ./data/demo-plan.yaml --yes
The project is continuously evolving with a focus on:
- DJ Skills: Packaging tools into soft-orchestrated skills for general-purpose agents.
- InteRecipe Studio: A frontend for interactive recipe construction via natural language.
- Broader Scenarios: Expanding support for RAG, Embodied Intelligence, and Data Lakehouse architectures.
For more detailed documentation, you can visit the DJ-Agents GitHub Repository.
- Type:
concept_qa - Language:
en
Input (user query)
Tell me about the base class of operators
Output (assistant answer)
I'll help you find information about the base class of operators in Data-Juicer. Let me search for this in the codebase.
Great! I found the base operator file. Let me examine it to understand the base class structure.
Perfect! Now I have a comprehensive understanding of the base class structure for operators in Data-Juicer. Let me provide you with a detailed explanation.
In Data-Juicer, all operators inherit from a common base class OP (Operator), which is defined in data_juicer/ops/base_op.py.
The OP class serves as the foundation for all operator types and provides:
_name: The operator name (automatically set by the registry)_accelerator: Hardware accelerator ("cpu" or "cuda")_batched_op: Whether this operator processes batches_requirements: Extra package requirements
- Data keys:
text_key,image_key,audio_key,video_key,image_bytes_key - Chat format keys:
system_key,instruction_key,query_key,response_key,history_key - Processing config:
batch_size,work_dir,skip_op_error,accelerator - Resource allocation:
num_cpus,num_gpus,memory,runtime_env - Parallelism:
auto_op_parallelism,num_proc
process(): Abstract method - must be implemented by subclassesrun(dataset): Executes the operator on a datasetruntime_np(): Calculates the number of processes to useuse_cuda(): Checks if CUDA acceleration is availableis_batched_op(): Determines if batch processing is enabledget_env_spec(): Returns environment specification for dependencies
Data-Juicer defines 6 main operator types, each extending OP:
- Purpose: Modifies/transforms samples (sample → sample)
- Key methods:
process_single(sample): Process one sampleprocess_batched(samples): Process batch of samples
- Examples: Text cleaning, image resizing, content replacement
- Purpose: Removes samples based on criteria (sample → Boolean)
- Key methods:
compute_stats_single(sample): Calculate filtering metricsprocess_single(sample): Return True/False for keep/filter
- Features: Supports min/max thresholds, closed/open intervals, reversed ranges
- Examples: Length filters, quality filters, toxicity filters
- Purpose: Identifies and removes duplicate samples
- Key methods:
compute_hash(sample): Generate hash for deduplicationprocess(dataset): Perform deduplication on dataset
- Examples: MinHash dedup, exact match dedup
- Purpose: Selects subsets of data at dataset level (dataset → dataset)
- Key method:
process(dataset) - Examples: Top-k selection, range-based selection
- Purpose: Groups samples together (dataset → grouped dataset)
- Key method:
process(dataset) - Examples: Key-value grouping, bucket grouping
- Purpose: Aggregates grouped samples (batched sample → single sample)
- Key method:
process_single(sample) - Note: Must be used after a Grouper
- Examples: Merging grouped texts, combining annotations
- Purpose: Represents a complete data processing pipeline
- Key method:
run(dataset) - Examples: Multi-step workflows
-
Registry System: All operators are registered using the
OPERATORSregistry for automatic discovery -
Batch Processing Support: Operators can work in single-sample or batch mode via
_batched_opflag -
Error Handling: Built-in exception catching with
catch_map_batches_exceptionandcatch_map_single_exception -
Tracing Integration: Wrappers (
wrap_mapper_with_tracer,wrap_filter_with_tracer) enable sample-level change tracking -
Ray Integration: Support for distributed processing with configurable CPU/GPU/memory resources
-
CUDA Acceleration: Automatic detection and utilization of GPU when available
-
Metaclass:
OPMetaClasscaptures initialization arguments for serialization/replay
When creating a custom operator, you typically:
- Inherit from the appropriate type (Mapper, Filter, etc.)
- Implement the required
process_*methods - Register it with the decorator
@OPERATORS.register_module()
For more details on implementing operators, see the Operators Documentation.
- Type:
quick_start_qa - Language:
en
Input (user query)
How to install DJ?
Output (assistant answer)
To install Data-Juicer (DJ), you can follow the official installation guide. Here are the common methods:
This is the simplest way to install the core Data-Juicer package.
pip install data-juicerIf you want to use the latest features or contribute to the project, you can clone the repository and install it in editable mode.
git clone https://github.com/datajuicer/data-juicer.git
cd data-juicer
pip install -e .Data-Juicer supports various data modalities (text, image, audio, video). You may need to install extra dependencies depending on your use case.
- For full functionality (including all modalities):
pip install data-juicer[all]
- For specific modalities:
- Text only:
pip install data-juicer - Image support:
pip install data-juicer[image] - Audio support:
pip install data-juicer[audio] - Video support:
pip install data-juicer[video]
- Text only:
You can also use Docker to run Data-Juicer without setting up the environment locally. Pre-built images are often available in the repository's Docker instructions.
# Example command (check the latest tag in the repo)
docker pull datajuicer/data-juicer:latestAfter installation, you can verify it by running:
dj --helpFor detailed instructions, including system requirements and troubleshooting, please refer to the official Installation Guide.
- Type:
best_practice_qa - Language:
en
Input (user query)
Could you give me some best practices for using DJ?
Output (assistant answer)
To provide you with the most accurate best practices for using Data-Juicer (DJ), I will search for official documentation, tutorials, and community guidelines within the Data-Juicer repositories.
I will now look for the official Data-Juicer repository to find best practices and documentation.
I will now examine the data-juicer-hub repository, as its description mentions "best practices," and also look at the main data-juicer documentation.
I will examine the DeveloperGuide.md and the RecipeGallery.md in the data-juicer-hub to find best practices.
I will now look at the RecipeGallery.md in the data-juicer-hub repository to find more best practices and community-contributed recipes.
I will also check the awesome_llm_data.md file in the main repository, as it likely contains valuable insights and best practices for data processing.
Based on the official Data-Juicer documentation, Developer Guide, and the Data-Juicer Hub, here are the best practices for using Data-Juicer (DJ):
Instead of building pipelines from scratch, start with existing Recipes from the Data-Juicer Hub.
- Reproduce & Refine: Use community-contributed recipes that reproduce popular datasets like RedPajama, The Pile, or LLaVA.
- Iterative Improvement: DJ provides "refined" versions of these datasets. For example, using a simple 3-σ rule to filter outliers in RedPajama subsets can significantly improve model performance while retaining ~90-95% of the data.
- Modality-Specific Recipes: Utilize specific recipes for Text, Image-Text (e.g., LCS-558k), and Video (e.g., InternVid/Panda-70M) processing.
When creating custom Operators (OPs), follow the tiered contribution model:
- Alpha Version: Focus on basic functionality. Implement
compute_stats_singleandprocess_single. - Beta Version: Add robustness. Include unit tests (
tests/ops/...) and comprehensive docstrings. Updateconfig_all.yamlso your OP appears in the help menus. - Stable Version: Optimize for performance.
- Enable Batched Processing by setting
_batched_op = Trueand implementingprocess_batched. - Enable GPU Acceleration by setting
_accelerator = 'cuda'if using models. - Implement OP Fusion by registering intermediate variables (like
wordsorlines) to avoid redundant calculations across multiple filters.
- Enable Batched Processing by setting
- Hierarchical Configs: Use YAML files with dot notation (e.g.,
--text_length_filter.min 10). DJ automatically registers configs from operator docstrings, keeping your code and config in sync. - Lazy Loading: DJ uses a lazy loading system. You don't need to install every dependency upfront. Dependencies are installed on-demand when an operator is first used.
- Environment Management: Use
uvfor fast virtual environment management and dependency resolution as recommended in the Developer Guide.
- Bad Data Exhibition: Regularly check the Bad Data Exhibition to understand common data quality issues (e.g., repetition, low information density).
- Statistical Filtering: Use statistical methods (like the 3-σ rule) to set hyperparameters for operators rather than guessing thresholds.
- Model-as-a-Judge: Leverage DJ's ability to use MLLMs as evaluators (e.g., for toxicity, aesthetics, or cross-modal alignment) to create high-quality synthetic data or filter existing datasets.
- Distributed Processing: For large-scale datasets, use DJ's distributed mode (often powered by Ray) to parallelize processing across multiple nodes.
- Caching & Checkpoints: Enable caching and partitioning to handle long-running jobs. This allows you to resume processing from the last checkpoint if a job fails.
- Tracing: Use the built-in tracing features to monitor the flow of data and identify bottlenecks in your processing pipeline.
- Share Your Recipes: If you develop a successful processing pipeline, contribute it to the Data-Juicer Hub. This promotes reuse and helps establish best practices for specific domains (e.g., medical or legal text).
- Operator Fusion: When developing new operators, consider how they might share intermediate results with existing ones to speed up the entire pipeline.
For more detailed technical steps, refer to the Developer Guide and the Operators Documentation.
- Type:
off_topic_qa - Language:
en
Input (user query)
How's the weather today?
Output (assistant answer)
Sorry, this question is unrelated to Data-Juicer.
