Implement Attention Mechanism

medium

Given three tensors two-dimensional: `query`, `key`, and `value`, you need to implement the attention mechanism used in transformer architectures. The attention mechanism calculates scores based on the dot product of the query and key, scales the scores by dividing by the square root of the depth (last dimension size) of the query, applies softmax to the scores, and then uses these scores to take a weighted sum of the value tensor.

First, compute the dot product of `query` and the transpose of `key`. Then, scale the scores by dividing by the square root of the depth of the `query`. Apply softmax on the scaled scores to get the weights. Finally, compute the dot product of the weights and the `value` tensor to get the output.

\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^T}{\sqrt{d_k}} \right) V

Where:

$Q$ : Query matrix. Represents the set of queries to which we want to pay attention over the keys.
$K$ : Key matrix. Represents the set of keys to which the queries need to attend.
$V$ : Value matrix. Contains the actual information/content from the input sequence that we want to focus on based on the attention scores.
$d_k$ : Depth or dimensionality of the queries and keys. Used to scale down the dot product in the attention mechanism to ensure stability.

Examples:

1.0

2.0

↓

2.1070

2.0070

1.0

↓

2.5

3.5

4.5

2.5

3.5

4.5

Implement Attention Mechanism

Examples:

Hint

torch.matmul

F.softmax