configs/claude-code/skills/data-juicer/SKILL.md
Primer for using the data-juicer Python library (also written `datajuicer` or `DJ`) — a YAML-driven, OP-based system for cleaning, filtering, deduplicating, transforming, and synthesizing text and multimodal data for foundation models. Use this skill whenever the user mentions data-juicer, DJ, dj-process, dj-analyze, "DJ format", building data recipes / YAML pipelines for LLM training data, or writing custom Filter / Mapper / Deduplicator / Selector / Aggregator / Grouper operators ("OPs"). Also reach for it when the user is putting together a data preprocessing pipeline for LLM pre-training, post-tuning, or multimodal datasets and DJ would be a natural fit, even if they haven't named the library yet — flagging DJ as an option is often the most helpful move.
npx skillsauth add poorrican/dotfiles data-juicerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Data-Juicer is a YAML-driven system for cleaning, filtering, deduplicating, transforming, and synthesizing data for foundation models. The main mental model: a config (YAML "recipe") describes a dataset and a list of operators ("OPs"); an executor streams the dataset through the OPs and writes the result. OPs are the unit of work — there are 200+ built-in ones, and writing your own is the common case for anything domain-specific.
Almost everything a user will ask about reduces to one of four questions:
Three reference files sit alongside this one for deeper material:
references/op-zoo.md — the categorical OP listing with descriptions, tags, and per-OP doc links.references/config-reference.md — the full top-level config schema (every field, not just the common ones).references/custom-ops.md — deeper patterns: batched processing, GPU acceleration, OP fusion, lazy dependencies.Load these only when the task actually needs them. For most "how do I get started / what does this look like" questions, this file is enough.
Every OP inherits from one of these base classes. The class determines its execution contract (what it takes in, what it puts out), so picking the right base class is the first decision when writing a custom OP.
| Type | Count | What it does |
|------------------|------:|---------------------------------------------------------------------------------------------------------------------------------------|
| Mapper | 105 | Edits / transforms a sample. Input = sample, output = sample. The workhorse for cleaning, normalization, augmentation, synthesis. |
| Filter | 56 | Keeps or drops a sample. Input = sample, output = bool. Two-phase: compute_stats_single writes a stat onto the sample, then process_single returns the keep/drop decision based on the stat. |
| Deduplicator | 10 | Removes duplicates across the dataset (exact, MinHash, SimHash, image/video hashes). Cross-sample, not per-sample. |
| Selector | 5 | Picks a subset based on ranking / frequency / range over a field (e.g. top-k, random sample). |
| Grouper | 3 | Groups N samples into one batched sample (input to Aggregators). |
| Aggregator | 4 | Reduces a batch of samples into one summary sample (e.g. summarization across docs, entity attribute aggregation). |
| Pipeline | 3 | Dataset-level processing — both input and output are full datasets (e.g. Ray + vLLM inference pipelines). |
| Formatter | 8 | Discovers, loads, canonicalizes source data into DJ format (csv, json, parquet, tsv, text, HF hub, etc.). Usually invoked implicitly by the dataset config — users rarely instantiate one directly. |
The Filter / Mapper distinction is the one most worth being precise about: Filters never modify samples, they only annotate stats and return a boolean. Mappers freely modify the sample dict (including adding fields). If a custom OP both transforms and drops samples, it's two OPs.
For the full categorical listing of OPs, see references/op-zoo.md. For the canonical, always-current list, point the user to:
config_all.yaml: https://github.com/datajuicer/data-juicer/blob/main/data_juicer/config/config_all.yaml (every OP with every parameter, formatted as YAML — the most concrete reference for what an OP block looks like in a recipe)DJ operates on a non-recursive dict-per-sample schema, stored on disk as JSONL (preferred), JSON, Parquet, CSV, TSV, TXT, or jsonl.gz. Local files have their format auto-detected; remote HuggingFace and arXiv sources are also first-class.
A sample has three logical parts:
{
// 1. Core payload — directly used by training/fine-tuning downstream.
"text": "The quick brown fox...",
"query": "...", // for post-tuning / dialog datasets
"response": "...",
// 2. Extra data — paths to multimodal assets stored on disk as path lists.
"images": ["path/to/img1.jpg", "path/to/img2.jpg"],
"audios": ["path/to/audio.wav"],
"videos": ["path/to/video.mp4"],
// 3. Meta + stats — either intrinsic to the source, or produced by OPs.
"meta": {"src": "customized", "version": "0.1", "author": "..."},
"stats": {"lang": "en", "text_len": 1234, "image_widths": [224, 336]}
}
A few things worth internalizing:
text is the default payload key. Most text OPs read from sample[self.text_key] where text_key defaults to "text". If your data lives under a different field (e.g. content, body), set text_keys at the top level of the config rather than per-OP.stats and meta are reserved. Filters write into sample[Fields.stats][StatsKeys.<name>]. Don't put your own data there — use top-level fields instead.text. Default tokens: <__dj__image>, <__dj__audio>, <__dj__video>. Tokens correspond positionally to entries in the images / audios / videos arrays. Chunks within a sample are separated by <|__dj__eoc|> (end-of-chunk).tools/fmt_conversion/ convert popular formats (LLaVA, MMC4, WavCaps, Alpaca-CoT, ShareGPT, etc.) to and from DJ format. If a user has data in one of those formats, point them there before suggesting custom conversion: https://github.com/datajuicer/data-juicer/blob/main/tools/fmt_conversion/README.mdConfigs are YAML, parsed by jsonargparse. A minimal recipe has three things: where the data is, what to do, and where to put the result.
# minimal_recipe.yaml
project_name: 'my-cleanup-pass'
dataset_path: './raw_corpus.jsonl' # or use a `dataset:` block (see below)
export_path: './cleaned/result.jsonl' # supports .jsonl / .json / .parquet
np: 8 # number of subprocesses
process:
- whitespace_normalization_mapper: # mappers transform
- clean_html_mapper:
- text_length_filter: # filters drop based on stats
min_len: 50
max_len: 100000
- language_id_score_filter:
lang: 'en'
min_score: 0.8
- document_deduplicator: # dedup is its own category
lowercase: true
ignore_non_character: true
Run it:
dj-process --config minimal_recipe.yaml
# or, from source:
python tools/process_data.py --config minimal_recipe.yaml
process listop_name: followed by its parameters as a nested dict. An OP with no params still needs the trailing colon (whitespace_normalization_mapper:).TextLengthFilter → text_length_filter. The registered name comes from @OPERATORS.register_module('name') in the OP source.config_all.yaml for the OP name — every parameter is shown there with its default and a one-line comment.dataset: blockdataset_path: is the legacy shorthand. The modern form is a dataset: block, which supports remote sources, mixtures, and validation:
dataset:
max_sample_num: 100000
configs:
- type: 'local'
weight: 0.7
path: 'path/to/file1.jsonl'
- type: 'local'
weight: 0.3
path: 'path/to/file2.parquet'
- type: 'remote'
source: 'huggingface'
path: 'HuggingFaceFW/fineweb'
name: 'CC-MAIN-2024-10'
split: 'train'
limit: 1000
# Optional: validate before processing
validators:
- type: 'required_fields'
required_fields: ['text', 'meta']
field_types: {text: 'str', meta: 'dict'}
For full dataset config syntax (mixture weights, validators, remote sources), see https://datajuicer.github.io/data-juicer/en/main/docs/DatasetCfg.html
These are the ones that come up regularly. The full list is in references/config-reference.md.
text_keys: which sample fields hold the text payload. Default 'text'.np: subprocess count.executor_type: 'default' (single-machine HF Datasets) or 'ray' (distributed).use_cache: cache OP outputs between runs (default true). Cleared with dj-process --cleanup_cache.op_fusion: enable fusion of OPs that share intermediate variables (e.g. tokenization). Off by default; turn on for noticeable speedup on text-heavy recipes.custom_operator_paths: list of file paths or directories to register external OPs from. This is how you use a custom OP without modifying the DJ source tree.open_tracer + op_list_to_trace: emit before/after samples for specified OPs, useful for debugging recipe behavior.mem_required: '<size>GB' so DJ can throttle parallelism and avoid OOM.Anything in the config can be overridden via CLI in dot-notation:
dj-process --config recipe.yaml --language_id_score_filter.min_score=0.9 --np=16
--help and the analyzerdj-process --help prints the entire hierarchical config schema (every OP, every parameter, every type). It's verbose but authoritative.dj-analyze --auto --dataset_path my.jsonl [--auto_num 1000] runs all stat-producing Filters on a sample and emits a report — a good way to size up an unfamiliar dataset before writing a recipe.Two paths exist; pick by where the OP needs to live.
custom_operator_paths (recommended for your own projects)Write the OP in your own repo, register it via the config. No fork, no PR, no editing DJ's source.
# /my_project/ops/my_filter.py
import sys
from jsonargparse.typing import PositiveInt
from data_juicer.utils.constant import Fields, StatsKeys
from data_juicer.ops.base_op import OPERATORS, Filter
@OPERATORS.register_module('my_text_length_filter')
class MyTextLengthFilter(Filter):
"""Keep samples whose text length is in [min_len, max_len]."""
def __init__(self, min_len: PositiveInt = 10,
max_len: PositiveInt = sys.maxsize,
*args, **kwargs):
super().__init__(*args, **kwargs)
self.min_len = min_len
self.max_len = max_len
def compute_stats_single(self, sample):
# Filters compute a stat first…
if 'my_text_len' in sample[Fields.stats]:
return sample
sample[Fields.stats]['my_text_len'] = len(sample[self.text_key])
return sample
def process_single(self, sample):
# …then return True (keep) or False (drop) based on the stat.
return self.min_len <= sample[Fields.stats]['my_text_len'] <= self.max_len
Wire it in:
custom_operator_paths:
- '/my_project/ops/my_filter.py' # single file
# - '/my_project/ops/' # or a directory of files
process:
- my_text_length_filter:
min_len: 100
max_len: 50000
Same code, but lives at data_juicer/ops/<category>/<name>.py and gets exported via the category's __init__.py. Required if contributing to upstream; otherwise prefer Path 1 to keep your project decoupled from a DJ fork.
Every OP inherits from one of: Filter, Mapper, Deduplicator, Selector, Grouper, Aggregator. The base class dictates the contract:
| Base class | Methods to implement | Return contract |
|------------------|---------------------------------------------------------------|------------------------------------------|
| Mapper | process_single(sample) | Modified sample dict |
| Filter | compute_stats_single(sample) + process_single(sample) | Modified sample then bool (keep) |
| Deduplicator | compute_hash(sample) + dataset-level process(dataset) | Hash, then deduped dataset |
| Selector | process(dataset) | Selected subset |
| Grouper | process(dataset) | Dataset of batched samples |
| Aggregator | process_single(batched_sample) | Aggregated output |
Always:
@OPERATORS.register_module('snake_case_name') — this is what process: in the YAML looks for.super().__init__(*args, **kwargs) first.sample[self.text_key] (don't hardcode 'text').sample[Fields.stats][...] and check that key first to avoid recomputing if a previous fused OP already produced it.For batched processing, GPU acceleration, OP fusion (sharing intermediate computations between OPs), and lazy dependency loading, see references/custom-ops.md. These matter for performance once an OP is in real use, but are optional for a first pass.
For interactive work, tests, or embedding in a larger Python pipeline, skip the YAML and call OPs directly:
from data_juicer.core.data import NestedDataset
from data_juicer.ops.filter import TextLengthFilter
from data_juicer.ops.mapper import WhitespaceNormalizationMapper
ds = NestedDataset.from_dict({
"text": ["Short", "This passes the filter.", "Text with spaces"]
})
result = ds.process([
WhitespaceNormalizationMapper(),
TextLengthFilter(min_len=10),
])
for s in result:
print(s)
Useful when iterating on a recipe, writing unit tests for a custom OP, or running DJ as a step inside a larger Python program.
Bookmark these — they're the canonical sources of truth and are kept in sync with the code:
| Need | Link | |---------------------------------------------------|-------------------------------------------------------------------------------------------------------| | Browse all OPs with descriptions | https://datajuicer.github.io/data-juicer/en/main/docs/Operators.html | | Every OP with every parameter as YAML | https://github.com/datajuicer/data-juicer/blob/main/data_juicer/config/config_all.yaml | | API reference (Python classes) | https://datajuicer.github.io/data-juicer/en/main/api.html | | Developer guide (writing OPs, contributing) | https://datajuicer.github.io/data-juicer/en/main/docs/DeveloperGuide.html | | Dataset configuration (sources, mixture, validation) | https://datajuicer.github.io/data-juicer/en/main/docs/DatasetCfg.html | | DJ format spec (incl. multimodal) | https://github.com/datajuicer/data-juicer/blob/main/tools/fmt_conversion/README.md | | Cookbook (recipes, demos, competitions) | https://github.com/datajuicer/data-juicer/blob/main/docs/tutorial/DJ-Cookbook.md | | Recipe gallery (real recipes to fork) | https://datajuicer.github.io/data-juicer-hub/en/main/docs/RecipeGallery.html | | Distributed processing on Ray | https://datajuicer.github.io/data-juicer/en/main/docs/Distributed.html |
When the user asks for a recipe or custom OP, walk this loop before answering:
references/op-zoo.md or the OperatorZoo URL above. The library has 200+ OPs and the right one often already exists.tools/fmt_conversion/ before custom code.mem_required so DJ can manage parallelism, and consider executor_type: ray if the dataset is large.process: list deliberately. Cheap CPU filters first, expensive GPU/API mappers next, dedup near the end. Wrong order is the most common cause of slow recipes.development
Implement multiple GitHub issues sequentially as stacked branches in separate worktrees, with an implementer sub-agent and an independent reviewer sub-agent per issue. Use when the user gives you two or more dependent issues and asks for them to be implemented in order, or says "stacked branches", "sequential issues", "issue chain", "do these in worktrees", or describes a parent epic with child issues that build on each other. Also reach for this whenever the user wants implementation and verification done by separate agents.
testing
Use when an agent needs to produce, update, validate, or normalize a standardized research proposal artifact without running an interview. Defines the canonical structure, confidence-tag semantics, decision logic, and completion checks for proposal.md-style research plans.
development
Conducts a structured Socratic interview to produce a comprehensive markdown research proposal that handles cascading uncertainty (fixed end-question, branching experiments). Use this skill whenever the user wants to write a research proposal, research plan, study design, experiment plan, thesis proposal, RFC, or "spec out" a research direction — even if they don't explicitly say "interview me." Trigger when the user says things like "help me plan this research", "I want to design experiments for X", "draft a proposal for...", "think through a research direction", or shares a half-formed research idea and asks for help structuring it. The skill interviews the user, challenges their priors with evidence requests and falsifiers, optionally uses sub-agents to explore prior art, and builds the proposal markdown incrementally so context stays clean and the document is always grounded.
testing
Use when an agent needs to produce, update, validate, or normalize a standardized experiment-log entry without running an interview. Defines the canonical structure, pre-registration rules, evidence/interpretation split, calibration tags, and append-only revision model for durable experiment records.