German LLM Tokenizers
AI & ML interests
None defined yet.
german-llm-tokenizers's activity
cschroederย
posted
an
update
about 1 month ago
Post
526
๐ฅ ๐
๐ข๐ง๐๐ฅ ๐๐๐ฅ๐ฅ ๐๐ง๐ ๐๐๐๐๐ฅ๐ข๐ง๐ ๐๐ฑ๐ญ๐๐ง๐ฌ๐ข๐จ๐ง: Survey on Data Annotation and Active Learning
Short summary: We need your support for a web survey in which we investigate how recent advancements in natural language processing, particularly LLMs, have influenced the need for labeled data in supervised machine learning โ with a focus on, but not limited to, active learning. See the original post for details.
โก๏ธ Extended Deadline: January 26th, 2025.
Please consider participating or sharing our survey! (If you have any experience with supervised learning in natural language processing, you are eligible to participate in our survey.)
Survey: https://bildungsportal.sachsen.de/umfragen/limesurvey/index.php/538271
Short summary: We need your support for a web survey in which we investigate how recent advancements in natural language processing, particularly LLMs, have influenced the need for labeled data in supervised machine learning โ with a focus on, but not limited to, active learning. See the original post for details.
โก๏ธ Extended Deadline: January 26th, 2025.
Please consider participating or sharing our survey! (If you have any experience with supervised learning in natural language processing, you are eligible to participate in our survey.)
Survey: https://bildungsportal.sachsen.de/umfragen/limesurvey/index.php/538271
cschroederย
posted
an
update
about 2 months ago
Post
413
Hereโs just one of the many exciting questions from our survey. If these topics resonate with you and you have experience working on supervised learning with text (i.e., supervised learning in Natural Language Processing), we warmly invite you to participate!
Survey: https://bildungsportal.sachsen.de/umfragen/limesurvey/index.php/538271
Estimated time required: 5โ15 minutes
Deadline for participation: January 12, 2025
โ
โค๏ธ Weโre seeking responses from across the globe! If you know 1โ3 people who might qualify for this surveyโparticularly those in different regionsโplease share it with them. Weโd really appreciate it!
#NLProc #ActiveLearning #ML
Survey: https://bildungsportal.sachsen.de/umfragen/limesurvey/index.php/538271
Estimated time required: 5โ15 minutes
Deadline for participation: January 12, 2025
โ
โค๏ธ Weโre seeking responses from across the globe! If you know 1โ3 people who might qualify for this surveyโparticularly those in different regionsโplease share it with them. Weโd really appreciate it!
#NLProc #ActiveLearning #ML
cschroederย
posted
an
update
2 months ago
Post
364
๐ก๐๐ผ๐ผ๐ธ๐ถ๐ป๐ด ๐ณ๐ผ๐ฟ ๐๐๐ฝ๐ฝ๐ผ๐ฟ๐: ๐๐ฎ๐๐ฒ ๐๐ผ๐ ๐ฒ๐๐ฒ๐ฟ ๐ต๐ฎ๐ฑ ๐๐ผ ๐ผ๐๐ฒ๐ฟ๐ฐ๐ผ๐บ๐ฒ ๐ฎ ๐น๐ฎ๐ฐ๐ธ ๐ผ๐ณ ๐น๐ฎ๐ฏ๐ฒ๐น๐ฒ๐ฑ ๐ฑ๐ฎ๐๐ฎ ๐๐ผ ๐ฑ๐ฒ๐ฎ๐น ๐๐ถ๐๐ต ๐ฎ๐ป ๐ก๐๐ฃ ๐๐ฎ๐๐ธ?
Are you working on Natural Language Processing tasks and have faced the challenge of a lack of labeled data before? ๐ช๐ฒ ๐ฎ๐ฟ๐ฒ ๐ฐ๐๐ฟ๐ฟ๐ฒ๐ป๐๐น๐ ๐ฐ๐ผ๐ป๐ฑ๐๐ฐ๐๐ถ๐ป๐ด ๐ฎ ๐๐๐ฟ๐๐ฒ๐ to explore the strategies used to address this bottleneck, especially in the context of recent advancements, including but not limited to large language models.
The survey is non-commercial and conducted solely for academic research purposes. The results will contribute to an open-access publication that also benefits the community.
๐ With only 5โ15 minutes of your time, you would greatly help to investigate which strategies are used by the #NLP community to overcome a lack of labeled data.
โค๏ธHow you can help even more: If you know others working on supervised learning and NLP, please share this survey with themโweโd really appreciate it!
Survey: https://bildungsportal.sachsen.de/umfragen/limesurvey/index.php/538271
Estimated time required: 5โ15 minutes
Deadline for participation: January 12, 2025
#NLP #ML
Are you working on Natural Language Processing tasks and have faced the challenge of a lack of labeled data before? ๐ช๐ฒ ๐ฎ๐ฟ๐ฒ ๐ฐ๐๐ฟ๐ฟ๐ฒ๐ป๐๐น๐ ๐ฐ๐ผ๐ป๐ฑ๐๐ฐ๐๐ถ๐ป๐ด ๐ฎ ๐๐๐ฟ๐๐ฒ๐ to explore the strategies used to address this bottleneck, especially in the context of recent advancements, including but not limited to large language models.
The survey is non-commercial and conducted solely for academic research purposes. The results will contribute to an open-access publication that also benefits the community.
๐ With only 5โ15 minutes of your time, you would greatly help to investigate which strategies are used by the #NLP community to overcome a lack of labeled data.
โค๏ธHow you can help even more: If you know others working on supervised learning and NLP, please share this survey with themโweโd really appreciate it!
Survey: https://bildungsportal.sachsen.de/umfragen/limesurvey/index.php/538271
Estimated time required: 5โ15 minutes
Deadline for participation: January 12, 2025
#NLP #ML
Post
1539
My latest project is the outcome of the last 2+ years working with TPUs from the amazing TPU Research Cloud (TRC) program and training Encoder-only LMs with the TensorFlow Model Garden library.
๐ Link: https://github.com/stefan-it/model-garden-lms
An overview of some features:
- Cheatsheet for setting-up a TPU VM Pod (with all necessary dependencies) to pretrain LMs with TF Model Garden
- Conversion scripts that convert TF Model Garden weights to Hugging Face Transformers-compatible models
- Supported architectures include BERT, BERT with Token Dropping and TEAMS
I also released BERT-based models pretrained on the great Hugging Face FineWeb and FineWeb-Edu datasets (10BT subset). With more to come!
๐ Model Hub Link: https://huggingface.co/model-garden-lms
If you find these resources useful, please give them a like!
Made from Bavarian Oberland with โค๏ธ and ๐ฅจ.
๐ Link: https://github.com/stefan-it/model-garden-lms
An overview of some features:
- Cheatsheet for setting-up a TPU VM Pod (with all necessary dependencies) to pretrain LMs with TF Model Garden
- Conversion scripts that convert TF Model Garden weights to Hugging Face Transformers-compatible models
- Supported architectures include BERT, BERT with Token Dropping and TEAMS
I also released BERT-based models pretrained on the great Hugging Face FineWeb and FineWeb-Edu datasets (10BT subset). With more to come!
๐ Model Hub Link: https://huggingface.co/model-garden-lms
If you find these resources useful, please give them a like!
Made from Bavarian Oberland with โค๏ธ and ๐ฅจ.
cschroederย
posted
an
update
3 months ago
Post
1089
๐ฃ New release: small-text v2.0.0.dev1
With small language models on the rise, the new version of small-text has been long overdue! Despite the generative AI hype, many real-world tasks still rely on supervised learningโwhich is reliant on labeled data.
Highlights:
- Four new query strategies: Try even more combinations than before.
- Vector indices integration: HNSW and KNN indices are now available via a unified interface and can easily be used within your code.
- Simplified installation: We dropped the torchtext dependency and cleaned up a lot of interfaces.
Github: https://github.com/webis-de/small-text
๐ Try it out for yourself! We are eager to hear your feedback.
๐ง Share your small-text applications and experiments in the newly added showcase section.
๐ Support the project by leaving a star on the repo!
#activelearning #nlproc #machinelearning
With small language models on the rise, the new version of small-text has been long overdue! Despite the generative AI hype, many real-world tasks still rely on supervised learningโwhich is reliant on labeled data.
Highlights:
- Four new query strategies: Try even more combinations than before.
- Vector indices integration: HNSW and KNN indices are now available via a unified interface and can easily be used within your code.
- Simplified installation: We dropped the torchtext dependency and cleaned up a lot of interfaces.
Github: https://github.com/webis-de/small-text
๐ Try it out for yourself! We are eager to hear your feedback.
๐ง Share your small-text applications and experiments in the newly added showcase section.
๐ Support the project by leaving a star on the repo!
#activelearning #nlproc #machinelearning
cschroederย
posted
an
update
4 months ago
Post
697
#EMNLP2024 is happening soon! Unfortunately, I will not be on site, but I will present our poster virtually on Wednesday, Nov 13 (7:45 EST / 13:45 CEST) in Virtual Poster Session 2.
In this work, we leverage self-training in an active learning loop in order to train small language models with even less data. Hope to see you there!
In this work, we leverage self-training in an active learning loop in order to train small language models with even less data. Hope to see you there!
cschroederย
posted
an
update
6 months ago
Post
401
โ๏ธ ๐๐ ๐๐ซ๐๐ข๐ง๐ข๐ง๐ ๐ข๐ฌ ๐๐จ๐ฉ๐ฒ๐ซ๐ข๐ ๐ก๐ญ ๐๐ง๐๐ซ๐ข๐ง๐ ๐๐ฆ๐๐ง๐ญ
This bold claim is not my opinion, but it has been made in a recent "report" of a group, whose stance is recognizable in their name. It is roughly translated as "Authors' Rights Initiative". They published a report which was also presented before the EU Parliament according to the LinkedIn post below.
I am not really interested in politics, but as an EU citizen I am of course somewhat interested in a reasonable and practical version of the EU AI Act. Not saying there should not be rules around data and AI, but this report is obviously very biased towards one side.
While I think the report itself does not deserve attention, I post it in the hope that you find more examples, where they did not address the issue adequately. Feel free to add to my LinkedIn posts (where the original authors will see it) or here.
[en] Executive summary: https://urheber.info/media/pages/diskurs/ai-training-is-copyright-infringement/3b900058e6-1725460935/executive-summary_engl_final_29-08-2024.pdf
[de] Full report: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4946214
LinkedIn: https://www.linkedin.com/posts/activity-7238912869268959232-6cFx
This bold claim is not my opinion, but it has been made in a recent "report" of a group, whose stance is recognizable in their name. It is roughly translated as "Authors' Rights Initiative". They published a report which was also presented before the EU Parliament according to the LinkedIn post below.
I am not really interested in politics, but as an EU citizen I am of course somewhat interested in a reasonable and practical version of the EU AI Act. Not saying there should not be rules around data and AI, but this report is obviously very biased towards one side.
While I think the report itself does not deserve attention, I post it in the hope that you find more examples, where they did not address the issue adequately. Feel free to add to my LinkedIn posts (where the original authors will see it) or here.
[en] Executive summary: https://urheber.info/media/pages/diskurs/ai-training-is-copyright-infringement/3b900058e6-1725460935/executive-summary_engl_final_29-08-2024.pdf
[de] Full report: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4946214
LinkedIn: https://www.linkedin.com/posts/activity-7238912869268959232-6cFx
cschroederย
posted
an
update
6 months ago
Post
723
๐ Liger Kernel: Efficient Triton Kernels for LLM Training
LIGER "is a [Hugging Face-compatible] collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduces memory usage by 60%."
GitHub: https://github.com/linkedin/Liger-Kernel
LIGER "is a [Hugging Face-compatible] collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduces memory usage by 60%."
GitHub: https://github.com/linkedin/Liger-Kernel
cschroederย
posted
an
update
6 months ago
Post
350
๐ ACL 2024: The Missing Papers
Apparently, some papers from the ACL 2024 are still not listed in the ACL Anthology. While this issue will hopefully be fixed soon, we should give those papers additional spotlight.
Some of my favorites:
1. Dolma is an English corpus that encompasses 3 trillion tokens. Additionally, it is accompanied by an exceptional software package that consdierably advances the state-of-the-art in preparing data for LLM pretraining. (Source: I am currently using Dolma.)
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research (2402.00159)
2. In the paper "Same Task, More Tokens: the Impact of Input Length on
the Reasoning Performance of Large Language Models", the authors show how extending the context length impacts an LLM's reasoning performance. I asked myself a similar question a few months ago, and therefore this paper is highly interesting to me.
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models (2402.14848)
This was brought to my attention through a Linkedin post by @ShayeghB , who is also affected:
Ensemble-Based Unsupervised Discontinuous Constituency Parsing by Tree Averaging (2403.00143)
View all the missing papers here:
https://theshayegh.github.io/ACL2024MissingPapers/
Apparently, some papers from the ACL 2024 are still not listed in the ACL Anthology. While this issue will hopefully be fixed soon, we should give those papers additional spotlight.
Some of my favorites:
1. Dolma is an English corpus that encompasses 3 trillion tokens. Additionally, it is accompanied by an exceptional software package that consdierably advances the state-of-the-art in preparing data for LLM pretraining. (Source: I am currently using Dolma.)
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research (2402.00159)
2. In the paper "Same Task, More Tokens: the Impact of Input Length on
the Reasoning Performance of Large Language Models", the authors show how extending the context length impacts an LLM's reasoning performance. I asked myself a similar question a few months ago, and therefore this paper is highly interesting to me.
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models (2402.14848)
This was brought to my attention through a Linkedin post by @ShayeghB , who is also affected:
Ensemble-Based Unsupervised Discontinuous Constituency Parsing by Tree Averaging (2403.00143)
View all the missing papers here:
https://theshayegh.github.io/ACL2024MissingPapers/
cschroederย
posted
an
update
6 months ago
Post
1882
๐ Release: small-text v1.4.1
The new release contains some smaller bugfixes. Check it out!
Github: https://github.com/webis-de/small-text
Paper: Small-Text: Active Learning for Text Classification in Python (2107.10314)
The new release contains some smaller bugfixes. Check it out!
Github: https://github.com/webis-de/small-text
Paper: Small-Text: Active Learning for Text Classification in Python (2107.10314)
cschroederย
updated
3
models
8 months ago
cschroederย
posted
an
update
9 months ago
Post
1471
๐ Release: small-text v1.4.0
The new version provides a small-text-compatible implementation for the recent AnchorAL strategy by @pietrolesci .
Github: https://github.com/webis-de/small-text
Paper: https://aclanthology.org/2023.eacl-demo.11/
AnchorAL: AnchorAL: Computationally Efficient Active Learning for Large and Imbalanced Datasets (2404.05623)
The new version provides a small-text-compatible implementation for the recent AnchorAL strategy by @pietrolesci .
Github: https://github.com/webis-de/small-text
Paper: https://aclanthology.org/2023.eacl-demo.11/
AnchorAL: AnchorAL: Computationally Efficient Active Learning for Large and Imbalanced Datasets (2404.05623)
cschroederย
authored
2
papers
10 months ago

stefan-itย
authored
2
papers
over 1 year ago

stefan-itย
authored
a
paper
almost 2 years ago