Vikas Kumar's picture

Vikas Kumar

vikas

AI & ML interests

None yet

Recent Activity

Organizations

Hugging Face Discord Community's profile picture

vikas's activity

upvoted an article 9 days ago
view article
Article

Train 400x faster Static Embedding Models with Sentence Transformers

โ€ข 120
upvoted an article 5 months ago
view article
Article

The 5 Most Under-Rated Tools on Hugging Face

โ€ข 86
upvoted an article 6 months ago
upvoted 2 articles 6 months ago
view article
Article

Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth

By mlabonne โ€ข
โ€ข 265
view article
Article

Llama 3.1 - 405B, 70B & 8B with multilinguality and long context

โ€ข 226
upvoted 2 articles 6 months ago
view article
Article

SmolLM - blazingly fast and remarkably powerful

โ€ข 299
upvoted an article 7 months ago
view article
Article

ColPali: Efficient Document Retrieval with Vision Language Models ๐Ÿ‘€

By manu โ€ข
โ€ข 192
reacted to fdaudens's post with ๐Ÿ‘ 7 months ago
view post
Post
3354
๐Ÿง  How to create more diverse, realistic synthetic AI training data?

@TencentAIGC-Lab AI Lab created @proj-persona , a vast collection of 1 billion diverse personas, to help create synthetic data with LLMs that encapsulate a wide array of perspectives, knowledge, experiences, interests, and professions.

These personas were created with automatically curated data, representing approximately 13% of the worldโ€™s total population.

๐Ÿ’ก The authors argue that integrating a persona into data synthesis prompts effectively steers LLMs to adopt specific perspectives, creating unique and relevant synthetic data with minimal effort.

They showcased various practical applications of Persona Hub to demonstrate its effectiveness and versatility in various synthetic data creation scenarios: mathematical and logical reasoning problems, simulating diverse user requests and prompts for LLMs, generating informative and detailed text content across various topics, and more.

๐Ÿš€ It's one of the trending datasets on Hugging Face. Digging into it is quite fun! I found one that reminds me of several people I know: "A journalist who covers technology and innovation in the print and digital media industries." It helped generate the prompt attached to this post (about which I'd be curious to know your answers ๐Ÿ˜‰).

Synthetic data is a hot topic in AI. It will be interesting to see if this research could help make LLMs more robust, versatile, and capable of handling a wide array of real-world scenarios.

๐Ÿ‘‰Explore the dataset: proj-persona/PersonaHub
๐Ÿ‘‰ Read the paper: https://arxiv.org/pdf/2406.20094
reacted to mrm8488's post with โค๏ธ 7 months ago
view post
Post
4963
๐ŸšจExciting news for the Multilingual Synthetic Data Community!๐Ÿšจ

Iโ€™ve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Hereโ€™s whatโ€™s new!

๐Ÿ—ž The MAGPIE paper showcased that if you use the instruction-tuned version (Llama-3-8B-instruct) to generate synthetic instructions and then fine-tune the base version (Llama-3-8B) on this dataset, you can improve even the it-tuned version

๐Ÿค” While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?

๐ŸŽ‰ And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.

๐Ÿ‘ฉโ€๐Ÿ’ป To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using ollama models (initially phi and llama3) automatically and upload it to the Hugging Face Hub!
[Script](https://gist.github.com/mrm8488/4650a5e3cc45523798a527a3446eb312)


๐Ÿ” Explore the datasets ๐Ÿ“š generated using our new script!

- [Llama-3-8B](https://huggingface.co/datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co/datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co/datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)


Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.

Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/
ยท
upvoted an article 7 months ago
view article
Article

Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models

โ€ข 184