Daniel van Strien's picture

Daniel van Strien PRO

davanstrien

·

https://danielvanstrien.xyz/

AI & ML interests

Machine Learning Librarian

Recent Activity

updated a dataset about 2 hours ago

librarian-bots/model_cards_with_metadata

updated a dataset about 3 hours ago

librarian-bots/dataset_cards_with_metadata

updated a dataset about 3 hours ago

librarian-bots/dataset-columns

View all activity

Articles

Open-R1: Update #1

Explore, Curate and Vector Search Any Hugging Face Dataset with Nomic Atlas

FineWeb2-C: Help Build Better Language Models in Your Language

Open Preference Dataset for Text-to-Image Generation by the 🤗 Community

Let’s make a generation of amazing image generation models

Share your open ML datasets on Hugging Face Hub!

Scaling AI-based Data Processing with Hugging Face + Dask

Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation

Data Is Better Together: A Look Back and Forward

Synthetic dataset generation techniques: generating custom sentence similarity data

Synthetic dataset generation techniques: Self-Instruct

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Data is better together

Extracting Insights from Model Cards Using Open Large Language Models

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Introducing BERTopic Integration with Hugging Face Hub

Jupyter X Hugging Face

Image search with 🤗 datasets

Organizations

davanstrien's activity

New activity in Qwen/Qwen2-VL-72B-Instruct 4 days ago

add new version metadata

#10 opened 4 days ago by

New activity in Qwen/Qwen2-VL-7B-Instruct 4 days ago

Add new version metadata

#75 opened 4 days ago by

commented a paper 4 days ago

WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

Paper • 2501.18511 • Published 4 days ago • 15 •

New activity in LabHC/histoires_morales 6 days ago

add citation information

#5 opened 6 days ago by

New activity in open-thoughts/OpenThoughts-114k 6 days ago

add synthetic tag metadata

#5 opened 6 days ago by

New activity in davanstrien/ColPali-Query-Generator 7 days ago

regex parsing

#6 opened 7 days ago by

Update app.py

#5 opened 7 days ago by

Update app.py

#4 opened 7 days ago by

use smaller model

#3 opened 7 days ago by

switch to qwen2.5 vl

#2 opened 7 days ago by

Update requirements.txt

#1 opened 7 days ago by

New activity in Qwen/Qwen2.5-VL-3B-Instruct 7 days ago

Remove recursive base_model

#2 opened 7 days ago by

New activity in data-is-better-together/fineweb-c 8 days ago

fix config names

#11 opened 8 days ago by

New activity in bespokelabs/Bespoke-Stratos-17k 12 days ago

fix broken github link

#2 opened 12 days ago by

Add synthetic tag

#1 opened 12 days ago by

New activity in yangbh217/MMSci_Table 12 days ago

Add citation info and minimal metadata

#1 opened 12 days ago by

commented a paper 18 days ago

The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models

Paper • 2501.09653 • Published 19 days ago • 12 •

New activity in TurkuNLP/finerweb-10bt 19 days ago

Release classifier and training data?

#3 opened 19 days ago by

New activity in NationalLibraryOfScotland/encyclopaedia_britannica_illustrated 20 days ago

Update README.md

#2 opened 20 days ago by

New activity in RZ412/PokerBench 20 days ago

add minimal metadata

#2 opened 20 days ago by