|
--- |
|
license: gpl-3.0 |
|
--- |
|
|
|
<div align="center"> |
|
<h1>Mamba-YOLO-World</h1> |
|
<h3>Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection</h3> |
|
Haoxuan Wang, Qingdong He, Jinlong Peng, Hao Yang, Mingmin Chi, Yabiao Wang |
|
|
|
<br> |
|
<br> |
|
|
|
[![arxiv paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2409.08513) |
|
|
|
</div> |
|
|
|
|
|
## Abstract |
|
Open-vocabulary detection (OVD) aims to detect objects beyond a predefined set of categories. |
|
As a pioneering model incorporating the YOLO series into OVD, YOLO-World is well-suited for scenarios prioritizing speed and efficiency. |
|
However, its performance is hindered by its neck feature fusion mechanism, which causes the quadratic complexity and the limited guided receptive fields. |
|
To address these limitations, we present Mamba-YOLO-World, a novel YOLO-based OVD model employing the proposed MambaFusion Path Aggregation Network (MambaFusion-PAN) as its neck architecture. |
|
Specifically, we introduce an innovative State Space Model-based feature fusion mechanism consisting of a Parallel-Guided Selective Scan algorithm and a Serial-Guided Selective Scan algorithm with linear complexity and globally guided receptive fields. |
|
It leverages multi-modal input sequences and mamba hidden states to guide the selective scanning process. |
|
Experiments demonstrate that our model outperforms the original YOLO-World on the COCO and LVIS benchmarks in both zero-shot and fine-tuning settings while maintaining comparable parameters and FLOPs. |
|
Additionally, it surpasses existing state-of-the-art OVD methods with fewer parameters and FLOPs. |
|
|
|
For our code and more information, please turn to https://github.com/Xuan-World/Mamba-YOLO-World |
|
|
|
<img src="visualization.png"> |