Language & LLMs

What Is Multi-Head Attention?

Multi-head attention splits the attention mechanism into several parallel heads, each learning to focus on different aspects of the input. The outputs of all heads are combined to form a richer representation. This lets transformers capture multiple types of relationships between tokens simultaneously.

Further reading

Read more about multi-head attention — articles and blogs from around the web: