Interaction
Data-Juicer is a one-stop system to process text and multimodal data for and with foundation models (typically LLMs). We provide a playground with a managed JupyterLab.
See more details in our homepage: https://github.com/modelscope/data-juicer
News
- [2025-01-11] We release our 2.0 paper, Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models. It now can process 70B data samples within 2.1h, using 6400 CPU cores on 50 Ray nodes from Alibaba Cloud cluster, and deduplicate 5TB data within 2.8h using 1280 CPU cores on 8 Ray nodes.
- [2025-01-03] We support post-tuning scenarios better, via 20+ related new OPs, and via unified dataset format compatiable to LLaMA-Factory and ModelScope-Swift.
- [2025-12-17] We propose HumanVBench, which comprises 17 human-centric tasks with synthetic data, benchmarking video-MLLMs' capabilities from views of inner emotion and outer manifestations. See more details in our paper, and try to evaluate your models with it.
- [2024-11-22] We release DJ v1.0.0, in which we refactored Data-Juicer's Operator, Dataset, Sandbox and many other modules for better usability, such as supporting fault-tolerant, FastAPI and adaptive resource management.
- [2024-08-25] We give a tutorial about data processing for multimodal LLMs in KDD'2024.