Stefan Schweter's picture

Stefan Schweter PRO

stefan-it

AI & ML interests

Flair Library πŸ’•, NER & PoS Tagging, LM Pretraining (mostly encoder-only & encoder-decoder), Historical Language Models

Recent Activity

Organizations

Bayerische Staatsbibliothek's profile picture flair's profile picture Flax Community's profile picture dumitrescustefan-org's profile picture GermanT5's profile picture BigScience: LMs for Historical Texts's profile picture BigLAM: BigScience Libraries, Archives and Museums's profile picture Universal NER's profile picture Libre Euro Lingua-Alliance's profile picture Lang UK's profile picture BabyLM Challenge's profile picture hmByT5's profile picture hmByT5 Preliminary's profile picture Blog-explorers's profile picture German Wikipedia LMs's profile picture hmBERT's profile picture hmTEAMS's profile picture HIPE's profile picture hmBERT Tiny's profile picture hmBERT 64k's profile picture LSV @ Saarland University's profile picture GERMATRON's profile picture PleIAs's profile picture German LLM Tokenizers's profile picture Social Post Explorers's profile picture Occiglot's profile picture GERTuraX's profile picture Stefmal's profile picture ScaDS.AI German LLM's profile picture ENGEBA's profile picture Nerdy Face's profile picture TensorFlow Model Garden LMs's profile picture

stefan-it's activity

posted an update 1 day ago
view post
Post
2755
She arrived 😍

[Expect more models soon...]
  • 1 reply
Β·
reacted to nicolay-r's post with πŸš€ 16 days ago
view post
Post
2331
πŸ“’ If you wish to empower LLM with IR and named entity recognition module, then I got relevant findings.
Just tested Flair below is how you can start for adapting for processing your CSV / JSONL data via bulk-ner
πŸ‘©β€πŸ’» code: https://github.com/nicolay-r/nlp-thirdgate/blob/master/tutorials/ner_flair_0151.sh
πŸ€– models: https://huggingface.co/flair

Provider: https://raw.githubusercontent.com/nicolay-r/nlp-thirdgate/refs/heads/master/ner/flair_0151.py
Framework: https://github.com/nicolay-r/bulk-ner

πŸš€ Performance: the default ner model (Thinkpad X1 Nano)
Batch-size 1 6it/sec
Batch-size 10+ 12it/sec

🌌 other wrappers for bulk-ner nlp-thirdgate: https://github.com/nicolay-r/nlp-thirdgate
reacted to davanstrien's post with πŸ”₯ 29 days ago
view post
Post
2026
🌍 Big step for multilingual AI data!

The Hugging Face community has rated educational content in languages spoken by 1.6 billion people! New additions:
β€’ Japanese
β€’ Italian
β€’ Old High German

Learn more and contribute: https://huggingface.co/blog/davanstrien/fineweb2-community

These ratings can help enhance training data for major world languages.
  • 1 reply
Β·
reacted to davanstrien's post with πŸš€ about 2 months ago
view post
Post
2255
The data-is-better-together/fineweb-c dataset is growing!

This week a few more languages have got 1,000 annotations for the educational quality of data from HuggingFaceFW/fineweb-2.

Why should you care?

The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data ( HuggingFaceFW/blogpost-fineweb-v1).

Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining.

Why not use an LLM?

LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in.

The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things:

- Evaluate whether an LLM can label the educational quality for texts in that language well
- Directly be used for training quality classifiers
- Help discover other rules and huerisitcs for refining fineweb2 further for different languages.

This week the following languages where done:

Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod

Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate

Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap

Want to learn more: https://huggingface.co/blog/davanstrien/fineweb2-community

Contribute yourself here: data-is-better-together/fineweb-c
  • 1 reply
Β·
reacted to nroggendorff's post with πŸ˜” about 2 months ago
view post
Post
3707
im so tired
  • 3 replies
Β·
reacted to nroggendorff's post with βž• about 2 months ago
view post
Post
6333
hey nvidia, can you send me a gpu?
comment or react if you want ~~me~~ to get one too. πŸ‘‰πŸ‘ˆ
Β·
reacted to davanstrien's post with πŸ”₯ 2 months ago
view post
Post
1802
Introducing FineWeb-C πŸŒπŸŽ“, a community-built dataset for improving language models in ALL languages.

Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.

318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

data-is-better-together/fineweb-c
replied to Kseniase's post 3 months ago
reacted to julien-c's post with πŸ‘ 3 months ago
view post
Post
9956
After some heated discussion πŸ”₯, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community πŸ”₯

cc: @reach-vb @pierric @victor and the HF team
Β·
reacted to thomwolf's post with πŸš€ 3 months ago
view post
Post
5763
We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of πŸ—£οΈlanguages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

πŸ₯‚ FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive πŸ“œ ODC-By 1.0 license, and the πŸ’» code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a πŸ“ blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi
  • 2 replies
Β·
posted an update 3 months ago
view post
Post
1539
My latest project is the outcome of the last 2+ years working with TPUs from the amazing TPU Research Cloud (TRC) program and training Encoder-only LMs with the TensorFlow Model Garden library.

πŸ‘‰ Link: https://github.com/stefan-it/model-garden-lms

An overview of some features:

- Cheatsheet for setting-up a TPU VM Pod (with all necessary dependencies) to pretrain LMs with TF Model Garden
- Conversion scripts that convert TF Model Garden weights to Hugging Face Transformers-compatible models
- Supported architectures include BERT, BERT with Token Dropping and TEAMS

I also released BERT-based models pretrained on the great Hugging Face FineWeb and FineWeb-Edu datasets (10BT subset). With more to come!

πŸ‘‰ Model Hub Link: https://huggingface.co/model-garden-lms

If you find these resources useful, please give them a like!

Made from Bavarian Oberland with ❀️ and πŸ₯¨.
reacted to davanstrien's post with πŸ”₯ 3 months ago
view post
Post
2504
First dataset for the new Hugging Face Bluesky community organisation: https://huggingface.co/datasets/bluesky-community/one-million-bluesky-posts πŸ¦‹

πŸ“Š 1M public posts from Bluesky's firehose API
πŸ” Includes text, metadata, and language predictions
πŸ”¬ Perfect to experiment with using ML for Bluesky πŸ€—

Excited to see people build more open tools for a more open social media platform!
reacted to nataliaElv's post with πŸ‘€ 3 months ago
view post
Post
1645
Would you like to get a high-quality dataset to pre-train LLMs in your language? 🌏

At Hugging Face we're preparing a collaborative annotation effort to build an open-source multilingual dataset as part of the Data is Better Together initiative.

Follow the link below, check if your language is listed and sign up to be a Language Lead!

https://forms.gle/s9nGajBh6Pb9G72J6
reacted to takeraparterer's post with ❀️ 7 months ago
reacted to lamhieu's post with 🀯 7 months ago
view post
Post
2109
πŸŽ‰ Ghost 8B Beta Released: Game-Changing Language Model
--
Ghost 8B Beta is a groundbreaking language model developed with a clear vision: to deliver exceptional multilingual support, superior knowledge capabilities, and all while remaining cost-effective. This model comes in two context length variations, 8k and 128k, ensuring flexibility for various tasks. Moreover, it boasts built-in multilingual functionality, making it a powerful tool for global communication and understanding.
--
* See detailed article: https://huggingface.co/blog/lamhieu/ghost-8b-beta-released-game-changing-language-mode
* Model card: ghost-x/ghost-8b-beta
* Official website: https://ghost-x.org/docs/models/ghost-8b-beta
reacted to BramVanroy's post with πŸ”₯ 10 months ago
view post
Post
2294
πŸ₯³ New license for datasets: Apache 2.0!

I have been struggling mentally for many months now with the OpenAI terms of use that indicate that their model outputs cannot be used to build "competing models". This leads to many questions:

- what is the definition of competing? Is it the same as "commercial"?
- since this is part of the terms of use between OpenAI and the API user, can a third party still use the generated dataset to build competing models?
- are such restrictions even legal in the first place?

Trying to "follow the rules" as much as possible despite wanting to be as open as possible, I kept releasing my datasets under non-commercial licenses (which are too restrictive anyhow - nothing should prevent you from using the data in non-LM commercial settings), just like models trained on these datasets. This has put me at a competitive disadvantage compared to creators who do not follow the same approach and release their data/models on apache 2.0 despite the OpenAI "restrictions". Moreover, I fear (https://twitter.com/BramVanroy/status/1780220420316164246) that my approach blocks adaptation of my data/models for (commercial) applications/integrations.

Thankfully @Rijgersberg noted that these OpenAI terms of use are NOT explicit in the Azure OpenAI API (https://twitter.com/E_Rijgersberg/status/1780308971762450725). Since my latest datasets were created via Azure, this comes as a relief. As far as I can tell after digging through Azure docs, this allows me to change all recent GPT4-generated datasets to apache 2.0! πŸ₯³

- BramVanroy/ultrachat_200k_dutch
- BramVanroy/orca_dpo_pairs_dutch
- BramVanroy/ultra_feedback_dutch
- BramVanroy/ultra_feedback_dutch_cleaned
- BramVanroy/no_robots_dutch

I will have to mull over what I'll do for the older GPT3.5 datasets. What do you think that I should do?
Β·
replied to BramVanroy's post 11 months ago
view reply

Hey @BramVanroy ,

am I right that no timestamp is included in their released dataset?

E.g. the CulturaX dataset would include this information - which is very useful I think.

reacted to BramVanroy's post with ❀️ 12 months ago
view post
Post
πŸ—„οΈ Massive data release on the HF Hub for 75 languages!

https://huggingface.co/datasets/BramVanroy/hplt_monolingual_v1_2

In December of last year, HPLT (https://hplt-project.org/) released version 1.2 of their dataset. It covers web-crawled data of 75 languages!, in the raw format as well as deduplicated and cleaned sections. In total, we're talking about over 40TB of data! This data was already accessible via their website but I figured the accessibility could be improved by an integration with Hugging Face tooling. πŸ€— So I added the dataset here to the Hugging Face hub, enabing direct use in your conventional training pipelines for LLMs or other language technologies. The data will automatically be downloaded and optimised with just one line of code:

load_dataset("BramVanroy/hplt_mono_v1_2", "nl_cleaned")

Let's use this big blob of data to build something awesome in our languages! πŸ₯³
  • 1 reply
Β·
reacted to mrm8488's post with πŸ€— 12 months ago
view post
Post
Hello world! πŸ”₯
reacted to BramVanroy's post with ❀️ about 1 year ago
view post
Post
πŸ’‘We recently launched a Discord server for #Dutch #NLP and #LLMs. We have more than 50 users of _very_ varrying backgrounds! πŸ§™πŸ‘©β€πŸ”¬πŸ§‘β€πŸŽ¨πŸ§‘β€πŸ«πŸ§‘β€πŸ’Ό We've already had discussions on eval, tokenizers, RAG, data... A bit of everything. Everyone is welcome to work together on Dutch NLP and LLMs! https://discord.gg/YUUXVZkZJ9
Β·