Nishith Jain's picture

Nishith Jain

KingNish

AI & ML interests

AI is fun actually. Busy till June 2025.

Recent Activity

liked a Space 3 days ago
modelscope/modelscope-studio
liked a Space 3 days ago
Remsky/Kokoro-TTS-Zero
liked a Space 4 days ago
lllyasviel/iclight-v2
View all activity

Articles

Organizations

Wikimedia's profile picture OpenGVLab's profile picture AMD's profile picture Blog-explorers's profile picture MultiπŸ€–Transformers's profile picture The Collectionists's profile picture HelpingAI's profile picture ZeroGPU Explorers's profile picture Project Fluently's profile picture Poscye's profile picture INNOVA AI's profile picture Narra's profile picture Social Post Explorers's profile picture Cognitive Computations's profile picture Dev Mode Explorers's profile picture Refine AI's profile picture Stable Diffusion Community (Unofficial, Non-profit)'s profile picture ONNX Community's profile picture Hugging Face Discord Community's profile picture qpump's profile picture

KingNish's activity

reacted to hexgrad's post with πŸ”₯ 13 days ago
view post
Post
2722
πŸ‡¬πŸ‡§ Four British voices have joined hexgrad/Kokoro-82M (Apache TTS model): bf_emma, bf_isabella, bm_george, bm_lewis
reacted to julien-c's post with πŸ€—πŸ”₯ about 1 month ago
view post
Post
8233
After some heated discussion πŸ”₯, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community πŸ”₯

cc: @reach-vb @pierric @victor and the HF team
Β·
reacted to clem's post with ❀️ about 1 month ago
view post
Post
4380
Hugging Face is becoming the best place to share the most viral AI apps with spaces.

Kolors Virtual Try-on just crossed 6,000,000 unique visitors & is now the #5 most popular space. Congrats to the Kwai Kolors team!

Kwai-Kolors/Kolors-Virtual-Try-On
  • 2 replies
Β·
reacted to hexgrad's post with πŸ”₯ about 1 month ago
view post
Post
2945
self.brag(): Kokoro finally got 300 votes in Pendrokar/TTS-Spaces-Arena after @Pendrokar was kind enough to add it 3 weeks ago.
Discounting the small sample size of votes, I think it is safe to say that hexgrad/Kokoro-TTS is currently a top 3 model among the contenders in that Arena. This is notable because:
- At 82M params, Kokoro is one of the smaller models in the Arena
- MeloTTS has 52M params
- F5 TTS has 330M params
- XTTSv2 has 467M params
Β·
reacted to prithivMLmods's post with πŸ”₯ about 2 months ago
view post
Post
3289
HF Posts Receipts πŸ†πŸš€

[ HF POSTS RECEIPT ] : prithivMLmods/HF-POSTS-RECEIPT

πŸ₯ The one thing that needs to be remembered is the 'username'.

πŸ₯ And yeah, thank you, @maxiw , for creating the awesome dataset and sharing them here! πŸ™Œ

πŸ₯ [ Dataset ] : maxiw/hf-posts

.
.
.
@prithivMLmods
reacted to merve's post with πŸ”₯ about 2 months ago
view post
Post
3915
Small yet mighty! πŸ’«

We are releasing SmolVLM: a new 2B small vision language made for on-device use, fine-tunable on consumer GPU, immensely memory efficient 🀠

We release three checkpoints under Apache 2.0: SmolVLM-Instruct, SmolVLM-Synthetic and SmolVLM-Base HuggingFaceTB/smolvlm-6740bd584b2dcbf51ecb1f39

Learn more from our blog here: huggingface.co/blog/smolvlm
This release comes with a demo, fine-tuning code, MLX integration and TRL integration for DPO πŸ’
Try the demo: HuggingFaceTB/SmolVLM
Fine-tuning Recipe: https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
Also TRL integration for DPO πŸ’—
reacted to merve's post with πŸ”₯ 2 months ago
view post
Post
5435
Another great week in open ML!
Here's a small recap 🫰🏻

Model releases
⏯️ Video Language Models
AI at Meta released Vision-CAIR/LongVU_Qwen2_7B, a new state-of-the-art long video LM model based on DINOv2, SigLIP, Qwen2 and Llama 3.2

πŸ’¬ Small language models
Hugging Face released HuggingFaceTB/SmolLM2-1.7B, a family of new smol language models with Apache 2.0 license that come in sizes 135M, 360M and 1.7B, along with datasets.
Meta released facebook/MobileLLM-1B, a new family of on-device LLMs of sizes 125M, 350M and 600M

πŸ–ΌοΈ Image Generation
Stability AI released stabilityai/stable-diffusion-3.5-medium, a 2B model with commercially permissive license

πŸ–ΌοΈπŸ’¬Any-to-Any
gpt-omni/mini-omni2 is closest reproduction to GPT-4o, a new LLM that can take image-text-audio input and output speech is released!

Dataset releases
πŸ–ΌοΈ Spawning/PD12M, a new captioning dataset of 12.4 million examples generated using Florence-2
reacted to prithivMLmods's post with πŸ‘ 2 months ago
view post
Post
4606
New DroppingsπŸ₯³

πŸ˜Άβ€πŸŒ«οΈCollection: prithivMLmods/flux-lora-collections-66dd5908be2206cfaa8519be

πŸ₯³Demo Here: prithivMLmods/FLUX-LoRA-DLC with more than 100+ Flux LoRA's

πŸͺ¨Fluid Dramatic Neon: prithivMLmods/Castor-Dramatic-Neon-Flux-LoRA
πŸͺ¨Past & Present Blend: prithivMLmods/Past-Present-Deep-Mix-Flux-LoRA
πŸͺ¨Tarot Cards Refreshed Themes: prithivMLmods/Ton618-Tarot-Cards-Flux-LoRA
πŸͺ¨Amxtoon Character Mix Real-Anime: prithivMLmods/Ton618-Amxtoon-Flux-LoRA
πŸͺ¨Epic Realism Flux v1: prithivMLmods/Ton618-Epic-Realism-Flux-LoRA
πŸͺ¨Mock-up Textures: prithivMLmods/Mockup-Texture-Flux-LoRA
.
.
.
@prithivMLmods πŸ€—
  • 2 replies
Β·
reacted to thomwolf's post with πŸš€ 3 months ago
view post
Post
4880
Is is time for the open-source AI robots revolution πŸš€?

With @haixuantao and @Leyo we’ve been playing with a low-cost DJI robot controlled by three local open-source AI models (Whisper, Idefics2, Parler-TTS - all Apache2) and orchestrated by Dora-cs.

Links to find all the hardware/software we used in the demo:
- robot control framework – dora-rs: https://github.com/dora-rs/dora
- speech-to-text model – whisper: openai/whisper-base
- vision-text model – Idefics2: HuggingFaceM4/idefics2-8b-AWQ
- text-to-speech model – ParlerTTS mini: parler-tts/parler_tts_mini_v0.1
- robot: https://dji.com/robomaster-s1
- code gist: https://gist.github.com/haixuanTao/860e1740245dc2c8dd85b496150a9320
- Larger codebase: dora-rs/dora-idefics2
- laptop/pc: any with a recent GPU card (our has a RTX 4090)

Enjoy!
Β·
reacted to singhsidhukuldeep's post with πŸ‘€ 3 months ago
view post
Post
1758
Good folks at @Apple have developed a novel method called KV Prediction that significantly reduces the "time to first token" (TTFT) for on-device LLM inference.

Some highlights of the paper:

β€’ Uses a small auxiliary transformer model to efficiently predict the KV cache of a larger base model
β€’ Reduces TTFT by up to 4x while retaining 60-80% accuracy on benchmarks
β€’ Achieves Pareto-optimal efficiency-accuracy trade-off compared to baselines
β€’ Demonstrates 15-50% relative accuracy improvements on TriviaQA at equal TTFT FLOP budgets
β€’ Shows up to 30% accuracy gains on HumanEval code completion at fixed TTFT FLOP counts
β€’ Validated on Apple M2 Pro CPU, proving FLOP gains translate to real-world speedups


So, how's it done?

Based on the KV Prediction method described in the paper, here are the key steps for how it's done:

1. Choose a base model and an auxiliary model:
- The base model is a larger, pretrained transformer model that will be used for final generation.
- The auxiliary model is a smaller transformer model used to efficiently process the input prompt.

2. Design the KV predictor:
- Create a set of learned linear projections to map from the auxiliary model's KV cache to the base model's KV cache.
- Define a mapping from auxiliary cache layers to base cache layers.

3. Training process:
- Pass input tokens through the auxiliary model to get its KV cache.
- Use the KV predictor to generate a predicted KV cache for the base model.
- Run the base model using the predicted KV cache and compute losses.
- Backpropagate errors through the frozen base model to update the auxiliary model and KV predictor.

4. Inference process:
- Process the input prompt with the auxiliary model to get its KV cache.
- Use the KV predictor to generate the predicted base model KV cache.
- Run a single token generation step with the base model using the predicted KV cache.
- Continue autoregressive generation with the base model as normal.

Excited to hear your thoughts!
reacted to Pendrokar's post with πŸ”₯ 3 months ago
view post
Post
1384
Made a notable change to the TTS Arena fork. I do not think anyone is interested in which bottomfeeder TTS is better than another beside it. So one of the top 5 TTS is always chosen in a challenge for more scrutiny. Also these top 5 are taken from preliminary results.
Pendrokar/TTS-Spaces-Arena
reacted to victor's post with πŸ”₯ 3 months ago
reacted to reach-vb's post with πŸ”₯ 3 months ago
view post
Post
5473
Multimodal Ichigo Llama 3.1 - Real Time Voice AI πŸ”₯

> WhisperSpeech X Llama 3.1 8B
> Trained on 50K hours of speech (7 languages)
> Continually trained on 45hrs 10x A1000s
> MLS -> WhisperVQ tokens -> Llama 3.1
> Instruction tuned on 1.89M samples
> 70% speech, 20% transcription, 10% text
> Apache 2.0 licensed ⚑

Architecture:
> WhisperSpeech/ VQ for Semantic Tokens
> Llama 3.1 8B Instruct for Text backbone
> Early fusion (Chameleon)

I'm super bullish on HomeBrew/ Jan and early fusion, audio and text, multimodal models!

(P.S. Play with the demo on Hugging Face: jan-hq/Ichigo-llama3.1-s-instruct)
reacted to mervenoyan's post with πŸ”₯ 3 months ago
reacted to DeFactOfficial's post with πŸš€ 3 months ago
view post
Post
2222
Ladies and Gents, please try my new Assistant, Image Gen - Uncensored Edition, on HuggingChat.

https://hf.co/chat/assistant/66fccce0c0fafc94ab557ef2

This is a multimodal assistant: Qwen 2.5 72B + SOTA diffusion models for image generation. Same architecture as Image Gen+ but with some MAJOR improvements ! These are as follows:

- Switched the LLM to Qwen 2.5 72B, the most powerful model currently available on HuggingChat. This results in higher quality prompts for the txt2img model and much better adherence to the prompt-url format that the upstream provider requires (image gen models are hosted by pollinations as with most other assistants on huggingchat that offer image generation).

- Cleaned up the system prompt including the examples of the prompt-in-url format, and adjusted the logic that determines how many images to generate based on the quality of user prompt... these changes further improve

- Assistant has access to multiple image generation models and will by default choose whatever model is most appropriate for the task. This includes NSFW generations, which it makes using an uncensored SD3 turbo. For other workloads, the Assistant preferentially uses one of the flux variants or any-dark (an artistic SDXL finetune), based on the nature of the task. Available models include turbo, flux, flux-realism, flux-anime, flux-3d, any-dark

- Added verbiage to system prompt which greatly reduces censorship / refusals by the LLM (the txt2img models are uncensored to start off)

Here are the user-entered prompts used to create the images you see here... feel free to try them yourself!

"Ayatollah Khameini and Kamala Harris having a secret romantic rendezvous. Use flux-realism model"
"A self portrait of your consciousness"
"The chien of andalous, in a psychedelic style"
"Make me 4 paintings in the style of Frida Kahlo that I can sell to tourists in a mexican hippie town"
"Paint me a van gogh and greg rutkowski style scene involving elephants and gerbils"
Β·
reacted to victor's post with πŸ€— 3 months ago
view post
Post
2673
NEW - Inference Playground

Maybe like me you have always wanted a super easy way to compare llama3.2-1B vs. llama3.2-3B? or the same model with different temperatures?

Trying and comparing warm Inference API models has never been easier!
Just go to https://hf.co/playground, set your token and you're ready to go.
We'll keep improving, feedback welcome 😊
  • 2 replies
Β·
reacted to reach-vb's post with πŸ”₯ 3 months ago
view post
Post
3127
NEW: Open Source Text/ Image to video model is out - MIT licensed - Rivals Gen-3, Pika & Kling πŸ”₯

> Pyramid Flow: Training-efficient Autoregressive Video Generation method
> Utilizes Flow Matching
> Trains on open-source datasets
> Generates high-quality 10-second videos
> Video resolution: 768p
> Frame rate: 24 FPS
> Supports image-to-video generation

> Model checkpoints available on the hub πŸ€—: rain1011/pyramid-flow-sd3
reacted to m-ric's post with πŸ”₯ 3 months ago
view post
Post
2941
Rhymes AI drops Aria: small Multimodal MoE that beats GPT-4o and Gemini-1.5-Flash ⚑️

New player entered the game! Rhymes AI has just been announced, and unveiled Aria – a multimodal powerhouse that's punching above its weight.

Key insights:

🧠 Mixture-of-Experts architecture: 25.3B total params, but only 3.9B active.

🌈 Multimodal: text/image/video β†’ text.

πŸ“š Novel training approach: β€œmultimodal-native” where multimodal training starts directly during pre-training, not just tacked on later

πŸ“ Long 64K token context window

πŸ”“ Apache 2.0 license, with weights, code, and demos all open

⚑️ On the benchmark side, Aria leaves some big names in the dust.

- It beats Pixtral 12B or Llama-3.2-12B on several vision benchmarks like MMMU or MathVista.
- It even overcomes the much bigger GPT-4o on long video tasks and even outshines Gemini 1.5 Flash when it comes to parsing lengthy documents.

But Rhymes AI isn't just showing off benchmarks. They've already got Aria powering a real-world augmented search app called β€œBeago”. It’s handling even recent events with great accuracy!

And they partnered with AMD to make it much faster than competitors like Perplexity or Gemini search.

Read their paper for Aria πŸ‘‰Β  Aria: An Open Multimodal Native Mixture-of-Experts Model (2410.05993)

Try BeaGo 🐢 πŸ‘‰Β https://rhymes.ai/blog-details/introducing-beago-your-smarter-faster-ai-search
  • 1 reply
Β·
reacted to merve's post with πŸ”₯ 3 months ago
view post
Post
3775
Meta AI vision has been cooking @facebook
They shipped multiple models and demos for their papers at @ECCV πŸ€—

Here's a compilation of my top picks:
- Sapiens is family of foundation models for human-centric depth estimation, segmentation and more, all models have open weights and demos πŸ‘

All models have their demos and even torchscript checkpoints!
A collection of models and demos: facebook/sapiens-66d22047daa6402d565cb2fc
- VFusion3D is state-of-the-art consistent 3D generation model from images

Model: facebook/vfusion3d
Demo: facebook/VFusion3D

- CoTracker is the state-of-the-art point (pixel) tracking model

Demo: facebook/cotracker
Model: facebook/cotracker