Cognition
Perception and abstraction. Each modality is tokenized and embedded into vectors for model to comprehend.
Paper • 2407.17453 • Published • 39Note General model is not great at specializing tasks. Narrow-domain fine-tuned checkpoint becomes better at specific tasks, such local improvement can feedback onto the full training dataset, achieving self-augmentation based improvement. This is a interesting idea.
Octopus v4: Graph of language models
Paper • 2404.19296 • Published • 116Note Use small language model to search the graph and route to the doman expert.
Octo-planner: On-device Language Model for Planner-Action Agents
Paper • 2406.18082 • Published • 47Note Automatic Flow Engineering done by 3B fine-tuned LLM, grounded on selective set of API-based functions. Planning model perform task decomposition, but do not do specific calls. Effectively doing flow (prompt) engineering here. Topology in plans are lacking and static plan-ahead approach is less robust (although good according to their curated 1k test dataset)
Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models
Paper • 2408.15518 • Published • 42Iterative Graph Alignment
Paper • 2408.16667 • Published • 2Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Paper • 2408.16725 • Published • 52MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 125DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper • 2403.05525 • Published • 39VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Paper • 2403.10517 • Published • 32LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Paper • 2409.02889 • Published • 55Law of Vision Representation in MLLMs
Paper • 2408.16357 • Published • 92VITA: Towards Open-Source Interactive Omni Multimodal LLM
Paper • 2408.05211 • Published • 47MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper • 2408.01800 • Published • 79NVLM: Open Frontier-Class Multimodal LLMs
Paper • 2409.11402 • Published • 72
WaveletGPT: Wavelets Meet Large Language Models
Paper • 2409.12924 • Published • 1Note Treating intermediate embedding sequences as a bunch of signals and apply 1D convolution on temporal axis, similar to ConvMixer's manipulation in some sense, experimentation conducted on pre-training transformer. Interesting result is reported in the paper. Unfortunately no 'wave' is actually applied, no 'periodic' information is captured.
ClaimVer: Explainable Claim-Level Verification and Evidence Attribution of Text Through Knowledge Graphs
Paper • 2403.09724 • Published • 1
Learning Iterative Reasoning through Energy Diffusion
Paper • 2406.11179 • Published • 1Note Newton's introduction of gravity illustrates how understanding derivatives—knowing how things move rather than just where they are—enhances reasoning about the world. Large language models (LLMs), while excelling at compressing data distributions, struggle with reasoning. Reasoning involves grasping the 'abstract structure' of data. Therefore, by modeling derivatives of data distributions, could we improve LLMs' reasoning capabilities?
Learnable Fourier Features for Multi-Dimensional Spatial Positional Encoding
Paper • 2106.02795 • Published • 1Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 104Can LLMs Reason in the Wild with Programs?
Paper • 2406.13764 • Published • 1MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models
Paper • 2409.17481 • Published • 46Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale
Paper • 2409.17115 • Published • 60Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization
Paper • 2403.03419 • Published • 1
Emu3: Next-Token Prediction is All You Need
Paper • 2409.18869 • Published • 93Note Tokenization unifies perception and generation, end-to-end training with discrete multi-modality signal enables both.
Can Models Learn Skill Composition from Examples?
Paper • 2409.19808 • Published • 8Not All LLM Reasoners Are Created Equal
Paper • 2410.01748 • Published • 28RATIONALYST: Pre-training Process-Supervision for Improving Reasoning
Paper • 2410.01044 • Published • 34
Intelligence at the Edge of Chaos
Paper • 2410.02536 • Published • 6Note Intelligence is very likely the ability to model higher order derivatives given lower order observation.
From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
Paper • 2410.02155 • Published • 2Note MLLM usually project a continuous Image embedding onto hidden space of LLM. Vector quantization (VQ) convert an image into discrete codes representing each of its patches, these tokens could be ported into LLM in a more similar fashion as text tokens -- new embedding vectors. Therefore a natural extension is just to re-use the BPE approach onto these image tokens. Which is precisely what happens in this work. However, I
Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation
Paper • 2410.02725 • Published • 1
Selective Attention Improves Transformer
Paper • 2410.02703 • Published • 23Note "If two computer programs perform the same task, the shorter one is generally better." This principle, known as Occam's Razor, is a critical guideline for scientific discovery. Our best program today is the Transformer. Can we make it more efficient? Selective attention improves the Transformer by allowing each token to decide whether previous context is still relevant for future tokens.
FAN: Fourier Analysis Networks
Paper • 2410.02675 • Published • 24EmbedLLM: Learning Compact Representations of Large Language Models
Paper • 2410.02223 • Published • 3Model Comparisons: XNet Outperforms KAN
Paper • 2410.02033 • Published • 1Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL
Paper • 2410.01930 • Published • 1Addition is All You Need for Energy-efficient Language Models
Paper • 2410.00907 • Published • 144
ε-VAE: Denoising as Visual Decoding
Paper • 2410.04081 • Published • 7Note I find it strange to view encoder which produces embedding vector as a type of tokenization --- then transformer effectively has two tokenization process... a discrete one and then a continuous one ?
Emergent properties with repeated examples
Paper • 2410.07041 • Published • 8Note Compression requires redundancy, otherwise it's just memorization
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Paper • 2410.06981 • Published • 2Executing Arithmetic: Fine-Tuning Large Language Models as Turing Machines
Paper • 2410.07896 • Published • 2Derivative-Free Guidance in Continuous and Discrete Diffusion Models with Soft Value-Based Decoding
Paper • 2408.08252 • Published • 1From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions
Paper • 2410.08197 • Published • 1Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Paper • 2410.06940 • Published • 6LeanAgent: Lifelong Learning for Formal Theorem Proving
Paper • 2410.06209 • Published • 1SimpleStrat: Diversifying Language Model Generation with Stratification
Paper • 2410.09038 • Published • 4Retriever-and-Memory: Towards Adaptive Note-Enhanced Retrieval-Augmented Generation
Paper • 2410.08821 • Published • 1Discrete Flow Matching
Paper • 2407.15595 • Published • 12Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
Paper • 2410.11081 • Published • 19EVOLvE: Evaluating and Optimizing LLMs For Exploration
Paper • 2410.06238 • Published • 1Neural Metamorphosis
Paper • 2410.11878 • Published • 8Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming
Paper • 2410.12112 • Published • 1Steering Large Language Models between Code Execution and Textual Reasoning
Paper • 2410.03524 • Published • 1A Scalable Communication Protocol for Networks of Large Language Models
Paper • 2410.11905 • Published • 1Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL
Paper • 2410.12491 • Published • 4Revealing the Barriers of Language Agents in Planning
Paper • 2410.12409 • Published • 24Learning to Compress: Local Rank and Information Compression in Deep Neural Networks
Paper • 2410.07687 • Published • 1Grandmaster-Level Chess Without Search
Paper • 2402.04494 • Published • 67Instruction-Driven Game Engine: A Poker Case Study
Paper • 2410.13441 • Published • 1Transformer Guided Coevolution: Improved Team Formation in Multiagent Adversarial Games
Paper • 2410.13769 • Published • 1Learning Graph Quantized Tokenizers for Transformers
Paper • 2410.13798 • Published • 1Fine-Tuning Discrete Diffusion Models via Reward Optimization with Applications to DNA and Protein Design
Paper • 2410.13643 • PublishedLearning to Route with Confidence Tokens
Paper • 2410.13284 • Published • 1An Evolved Universal Transformer Memory
Paper • 2410.13166 • Published • 3Artificial Kuramoto Oscillatory Neurons
Paper • 2410.13821 • Published • 1TopoLM: brain-like spatio-functional organization in a topographic language model
Paper • 2410.11516 • Published • 1Autoregressive Image Generation without Vector Quantization
Paper • 2406.11838 • Published • 3LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper • 2404.16710 • Published • 75DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing
Paper • 2410.12189 • Published • 1SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
Paper • 2410.13276 • Published • 25Do LLMs "know" internally when they follow instructions?
Paper • 2410.14516 • Published • 1Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models
Paper • 2410.10846 • Published • 2One-Step Diffusion Distillation through Score Implicit Matching
Paper • 2410.16794 • Published • 2Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass
Paper • 2405.18400 • Published • 1Lightweight Neural App Control
Paper • 2410.17883 • Published • 9Literature Meets Data: A Synergistic Approach to Hypothesis Generation
Paper • 2410.17309 • Published • 1
Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration
Paper • 2410.18076 • Published • 4Note Encodes interaction trajectories into "skill vectors" that act like abstract concepts: a skill decoder (low-level policy) translates them into specific actions based on the current state—similar to how our concepts become concrete actions in different situations. By relabeling experiences with these skills, they train a high-level policy to select optimal skills that maximize rewards. This hierarchical approach hints at the possibility for AI systems to formulate and think in their own-curated a
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting
Paper • 2410.17856 • Published • 49Non-myopic Generation of Language Model for Reasoning and Planning
Paper • 2410.17195 • Published • 1LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Paper • 2410.17434 • Published • 25Unbounded: A Generative Infinite Game of Character Life Simulation
Paper • 2410.18975 • Published • 35ToolGen: Unified Tool Retrieval and Calling via Generation
Paper • 2410.03439 • Published • 1
Accelerating Exploration with Unlabeled Prior Data
Paper • 2311.05067 • Published • 1Note Random network distillation as extra reward for exploration encouragement for RL.
Efficient Online Reinforcement Learning with Offline Data
Paper • 2302.02948 • Published • 2Note Re-using previous experience to increase RL learning efficiency.
Scaling Diffusion Language Models via Adaptation from Autoregressive Models
Paper • 2410.17891 • Published • 15Diffusion for World Modeling: Visual Details Matter in Atari
Paper • 2405.12399 • Published • 28Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
Paper • 2410.13835 • Published • 1PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Paper • 2410.17247 • Published • 45HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
Paper • 2410.10812 • Published • 15MCSD: An Efficient Language Model with Diverse Fusion
Paper • 2406.12230 • Published • 1The Scene Language: Representing Scenes with Programs, Words, and Embeddings
Paper • 2410.16770 • Published • 1Pyramidal Flow Matching for Efficient Video Generative Modeling
Paper • 2410.05954 • Published • 38Energy-Based Diffusion Language Models for Text Generation
Paper • 2410.21357 • Published • 1iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 12nGPT: Normalized Transformer with Representation Learning on the Hypersphere
Paper • 2410.01131 • Published • 9OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 46Inference Optimal VLMs Need Only One Visual Token but Larger Models
Paper • 2411.03312 • Published • 6
DroidSpeak: Enhancing Cross-LLM Communication
Paper • 2411.02820 • Published • 1Note Efficient cross-LLM communication through KV & E cache passing.
Wave Network: An Ultra-Small Language Model
Paper • 2411.02674 • Published • 3Thinking Forward and Backward: Effective Backward Planning with Large Language Models
Paper • 2411.01790 • Published • 1
Adaptive Length Image Tokenization via Recurrent Allocation
Paper • 2411.02393 • Published • 12Note Using fixed tokens to encode image, adding new tokens recursively until reaching satisfacotry compression level.
Improving Steering Vectors by Targeting Sparse Autoencoder Features
Paper • 2411.02193 • Published • 1How Far is Video Generation from World Model: A Physical Law Perspective
Paper • 2411.02385 • Published • 33Tool Learning with Foundation Models
Paper • 2304.08354 • Published • 3Spontaneous Emergence of Agent Individuality through Social Interactions in LLM-Based Communities
Paper • 2411.03252 • Published • 1Divide-and-Conquer Meets Consensus: Unleashing the Power of Functions in Code Generation
Paper • 2405.20092 • Published • 1The Road Less Scheduled
Paper • 2405.15682 • Published • 21Squeezed Attention: Accelerating Long Context Length LLM Inference
Paper • 2411.09688 • Published • 1On the Surprising Effectiveness of Attention Transfer for Vision Transformers
Paper • 2411.09702 • Published • 1BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
Paper • 2411.13543 • Published • 18XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models
Paper • 2411.15100 • Published • 5DynaSaur: Large Language Agents Beyond Predefined Actions
Paper • 2411.01747 • Published • 17Diffusion Self-Distillation for Zero-Shot Customized Image Generation
Paper • 2411.18616 • Published • 15SketchAgent: Language-Driven Sequential Sketch Generation
Paper • 2411.17673 • Published • 18WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
Paper • 2411.02337 • Published • 35Paper Copilot: A Self-Evolving and Efficient LLM System for Personalized Academic Assistance
Paper • 2409.04593 • Published • 23CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases
Paper • 2408.03910 • Published • 15An Empirical Study on LLM-based Agents for Automated Bug Fixing
Paper • 2411.10213 • Published • 1Cognitive Map for Language Models: Optimal Planning via Verbally Representing the World Model
Paper • 2406.15275 • Published • 11General-Purpose In-Context Learning by Meta-Learning Transformers
Paper • 2212.04458 • Published • 1Scattered Forest Search: Smarter Code Space Exploration with LLMs
Paper • 2411.05010 • Published • 1What's New in My Data? Novelty Exploration via Contrastive Generation
Paper • 2410.14765 • Published • 1Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
Paper • 2406.20086 • Published • 5Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
Paper • 2410.21272 • Published • 1Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS
Paper • 2411.18478 • Published • 32ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper • 2411.17465 • Published • 76Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper • 2411.14402 • Published • 42Retrofitting (Large) Language Models with Dynamic Tokenization
Paper • 2411.18553 • Published • 1Zero-Shot Tokenizer Transfer
Paper • 2405.07883 • Published • 5ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings
Paper • 2305.11554 • Published • 2Pandora: Towards General World Model with Natural Language Actions and Video States
Paper • 2406.09455 • Published • 15Frustratingly Simple Memory Efficiency for Pre-trained Language Models via Dynamic Embedding Pruning
Paper • 2309.08708 • Published • 3From Words to Music: A Study of Subword Tokenization Techniques in Symbolic Music Generation
Paper • 2304.08953 • Published • 2Parameter-Efficient Tuning with Special Token Adaptation
Paper • 2210.04382 • Published • 1From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding
Paper • 2305.14571 • Published • 1Multi-Word Tokenization for Sequence Compression
Paper • 2402.09949 • Published
OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining
Paper • 2311.08849 • Published • 5Note Think of your tool as a "new language", then the embedding of [CLS] token at the end of description text can be used to initialize the embedding of the new thingy -- this could also stabilize the training process I suppose.
Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models
Paper • 2403.00417 • Published • 2
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
Paper • 2410.23168 • Published • 24Note Unfortunately not impressive. Replacing linear layers with linear layers and call them attention between input x token is but a paraphrase. The 'scaling' idea is then just extending old ideas from Net2Net.
Configurable Foundation Models: Building LLMs from a Modular Perspective
Paper • 2409.02877 • Published • 27Continuous Speech Synthesis using per-token Latent Diffusion
Paper • 2410.16048 • Published • 29Puzzle: Distillation-Based NAS for Inference-Optimized LLMs
Paper • 2411.19146 • Published • 13DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding
Paper • 2411.19527 • Published • 10GiT: Towards Generalist Vision Transformer through Universal Language Interface
Paper • 2403.09394 • Published • 25Cut Your Losses in Large-Vocabulary Language Models
Paper • 2411.09009 • Published • 43Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Paper • 2411.14257 • Published • 9OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs
Paper • 2411.14199 • Published • 28Top-nσ: Not All Logits Are You Need
Paper • 2411.07641 • Published • 18Analyzing The Language of Visual Tokens
Paper • 2411.05001 • Published • 22VisualLens: Personalization through Visual History
Paper • 2411.16034 • Published • 16Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability
Paper • 2411.19943 • Published • 55Semantics and Spatiality of Emergent Communication
Paper • 2411.10173 • Published • 1Searching Latent Program Spaces
Paper • 2411.08706 • Published • 1Combining Induction and Transduction for Abstract Reasoning
Paper • 2411.02272 • Published • 1Trace is the New AutoDiff -- Unlocking Efficient Optimization of Computational Workflows
Paper • 2406.16218 • Published • 2Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction
Paper • 2411.14762 • Published • 11Training Large Language Models to Reason in a Continuous Latent Space
Paper • 2412.06769 • Published • 61Moto: Latent Motion Token as the Bridging Language for Robot Manipulation
Paper • 2412.04445 • Published • 21APOLLO: SGD-like Memory, AdamW-level Performance
Paper • 2412.05270 • Published • 38Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Paper • 2412.04454 • Published • 48Video Token Merging for Long-form Video Understanding
Paper • 2410.23782 • Published • 2CMT: A Memory Compression Method for Continual Knowledge Learning of Large Language Models
Paper • 2412.07393 • Published • 2MaestroMotif: Skill Design from Artificial Intelligence Feedback
Paper • 2412.08542 • Published • 1AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment
Paper • 2411.10606 • Published • 1Human Expertise in Algorithmic Prediction
Paper • 2402.00793 • Published • 1How to Synthesize Text Data without Model Collapse?
Paper • 2412.14689 • Published • 45Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
Paper • 2412.15213 • Published • 25Compressed Chain of Thought: Efficient Reasoning Through Dense Representations
Paper • 2412.13171 • Published • 30Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers
Paper • 2412.12276 • Published • 14Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents
Paper • 2412.13194 • Published • 12GenEx: Generating an Explorable World
Paper • 2412.09624 • Published • 84Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
Paper • 2412.04432 • Published • 14