Transformers & MoE - a RichardForests Collection

RichardForests 's Collections

Language Models

CV

RL

Diffusion models

3D/4D Gaussian Splatting

Mamba

NeRF

Transformers & MoE

(3D) Foundation Models

SSL

DL & Software DStructures

Dora

Flash Attention in Triton

Lora variations

Parameter Efficient - LLMs

Robotics - Cross Attention

DMs - Lighting Conditions

Transformers & MoE

updated May 21, 2024

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

Paper • 2312.07987 • Published Dec 13, 2023 • 41
Interfacing Foundation Models' Embeddings

Paper • 2312.07532 • Published Dec 12, 2023 • 11
Point Transformer V3: Simpler, Faster, Stronger

Paper • 2312.10035 • Published Dec 15, 2023 • 18
TheBloke/quantum-v0.01-GPTQ

Text Generation • Updated Dec 18, 2023 • 20 • 2
TheBloke/PiVoT-MoE-GPTQ

Text Generation • Updated Dec 17, 2023 • 29 • 1
mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ

Text Generation • Updated 4 days ago • 29 • 38
Denoising Vision Transformers

Paper • 2401.02957 • Published Jan 5, 2024 • 29
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Paper • 2401.06066 • Published Jan 11, 2024 • 47
Buffer Overflow in Mixture of Experts

Paper • 2402.05526 • Published Feb 8, 2024 • 8
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

Paper • 2405.08707 • Published May 14, 2024 • 30