Language & LLMs

What Is Grouped-Query Attention?

Grouped-query attention is a variant of multi-head attention where multiple query heads share the same key and value projections. This reduces the size of the KV cache and improves inference efficiency. It offers a balance between the quality of full multi-head attention and the efficiency of sharing.

Further reading

Read more about grouped-query attention — articles and blogs from around the web: