Meta’s New Multi-Token Attention!
Title: Multi-Token Attention Paper Link: https://arxiv.org/pdf/2504.00927 Innovations This paper presents a new attention method—Multi-Token Attention (MTA), which allows the model to adjust attention weights based on multiple query and key vectors simultaneously. By applying convolution operations on queries, keys, and attention heads, MTA enables nearby queries and keys to influence each other, resulting in more … Read more