Data-Juicer

community

https://github.com/modelscope/data-juicer

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

yxdyc updated a dataset 11 days ago

datajuicer/HumanVBench

yxdyc updated a Space 13 days ago

datajuicer/README

SarahZhout updated a dataset about 1 month ago

datajuicer/HumanVBench

View all activity

Organization Card

Community About org cards

Interaction

Data-Juicer is a one-stop system to process text and multimodal data for and with foundation models (typically LLMs). We provide a playground with a managed JupyterLab. See more details in our homepage: https://github.com/modelscope/data-juicer

News

[2025-01-11] We release our 2.0 paper, Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models. It now can process 70B data samples within 2.1h, using 6400 CPU cores on 50 Ray nodes from Alibaba Cloud cluster, and deduplicate 5TB data within 2.8h using 1280 CPU cores on 8 Ray nodes.
[2025-01-03] We support post-tuning scenarios better, via 20+ related new OPs, and via unified dataset format compatiable to LLaMA-Factory and ModelScope-Swift.
[2025-12-17] We propose HumanVBench, which comprises 17 human-centric tasks with synthetic data, benchmarking video-MLLMs' capabilities from views of inner emotion and outer manifestations. See more details in our paper, and try to evaluate your models with it.
[2024-11-22] We release DJ v1.0.0, in which we refactored Data-Juicer's Operator, Dataset, Sandbox and many other modules for better usability, such as supporting fault-tolerant, FastAPI and adaptive resource management.
[2024-08-25] We give a tutorial about data processing for multimodal LLMs in KDD'2024.

spaces 14

Data Visualization Op Insight

Auto Evaluation Helm

Data Process Loop

Data Mixture

Data Visualization Diversity

Data Visualization Op Effect

models 8

datajuicer/Data-Juicer-T2V-v2

Text-to-Video • Updated Sep 23, 2024 • 1

datajuicer/Data-Juicer-T2V

Text-to-Video • Updated Jul 17, 2024 • 3

datajuicer/LLaMA2-7B-ZH-Chat-52k

Text Generation • Updated Nov 10, 2023 • 5 • 1

datajuicer/LLaMA-7B-EN-Chat-40k

Text Generation • Updated Nov 10, 2023 • 8 • 1

datajuicer/LLaMA-1B-dj-refine-150B-instruct-4.7B

Text Generation • Updated Nov 10, 2023 • 10 • 1

datajuicer/LLaMA-1B-dj-refine-50B

Text Generation • Updated Nov 10, 2023 • 6

datajuicer/LLaMA-1B-dj-refine-100B

Text Generation • Updated Nov 10, 2023 • 8

datajuicer/LLaMA-1B-dj-refine-150B

Text Generation • Updated Nov 10, 2023 • 18.5k • 1

datasets 26

datajuicer/HumanVBench

Viewer • Updated 11 days ago • 2.27k • 972 • 2

datajuicer/Img-Diff

Updated Dec 20, 2024 • 100 • 2

datajuicer/data-juicer-t2v-evolution-data-pool

Updated Sep 23, 2024 • 2

datajuicer/data-juicer-t2v-optimal-data-pool

Viewer • Updated Jul 23, 2024 • 10 • 69

datajuicer/llava-pretrain-refined-by-data-juicer

Viewer • Updated Mar 7, 2024 • 10 • 68 • 2

datajuicer/alpaca-cot-en-refined-by-data-juicer

Viewer • Updated Nov 10, 2023 • 5 • 56

datajuicer/alpaca-cot-zh-refined-by-data-juicer

Viewer • Updated Nov 10, 2023 • 5 • 51 • 4

datajuicer/the-pile-nih-refined-by-data-juicer

Viewer • Updated Oct 23, 2023 • 100 • 40

datajuicer/the-pile-pubmed-abstracts-refined-by-data-juicer

Viewer • Updated Oct 23, 2023 • 100 • 48 • 2

datajuicer/the-pile-pubmed-central-refined-by-data-juicer

Viewer • Updated Oct 23, 2023 • 100 • 56 • 1