Mixture of Attention Schemes (MoAS): Learning to Route Between MHA, GQA, and MQA

Mixture of Attention Schemes (MoAS): Learning to Route Between MHA, GQA, and MQA

Tech3 months ago2.2K Views

 arXiv:2512.20650v1 Announce Type: new Abstract: The choice of attention mechanism in Transformer models involves a critical trade-off between modeling quality and inference efficiency. Multi-Head Attention (MHA) offers the best quality but suffers from large Key-Value (KV) cache memory requirements during inference. Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce memory usage but often at the cost of model performance. In this work, we propose Mixture of Attention Schemes (MoAS), a novel architecture that dynamically selects the optimal attention scheme (MHA, GQA, or MQA) for each token via a learned router. We demonstrate that dynamic routing performs better than static averaging of schemes and achieves performance competitive with the MHA baseline while offering potential for conditional compute efficiency. Experimental results on WikiText-2 show that dynamic routing (val loss 2.3074) outperforms a static mixture (2.3093), validating the effectiveness of the proposed method. Our code is available at https://github.com/Esmail-ibraheem/Mixture-of-Attention-Schemes-MoAS. – Read More

cs.AI updates on arXiv.org

0 Votes: 0 Upvotes, 0 Downvotes (0 Points)

Donations
Join Us
  • Follow Us On X Network
  • Follow Us On Youtube
  • Follow Us On Tik Tok

Stay Informed With the Latest & Most Important News

I consent to receive newsletter via email. For further information, please review our Privacy Policy

Advertisement

Loading Next Post...
Follow
Search Trending
Popular Now
Loading

Signing-in 3 seconds...

All fields are required.