Two-part network with an encoder to generate representations of the embdedded data, and a decoder to generate the output sequence
Combines:
Does not process inputs sequentially, so training and inference can be done in parallel
This allows for much longer sequences to be processed without fear of degradation
\[ Attention(Q,K,V) = (\frac{Q . K^T}{\sqrt(d_k)}) . V \] \[ MultiHeadAttn(Q,K,V) = (head_1,...,head_H)W^O \] \[ where \] \[ head_i = Attention(QW^Q_i,KW^K_i,VW^V_i) \]
\[ PE(pos,2i) = sin sin (\frac{pos}{10000^\frac{2i}{d_{model}}}) \]
\[ PE(pos,2i + 1) = cos (\frac{pos}{10000^\frac{2i}{d_{model}}}) \]
Any interesting tasks specifically people have heard of?
What tasks allow for encoders and decoders to be used separately?
Paper proposes a Mixture-of-Embedding-Experts (MEE) model that learns a joint text-video embedding.
Handles heterogenous data and missing input modalities during training and testing.
Vision Transformers (ViTs) are specifically designed for computer vision tasks.
Key components include:
ViTs are more efficient than CNNs, especially for larger images. Self-attention captures global information and contextual relationships between patches. Higher capacity than CNNs
Paper looks at improving throughput of ViT models without retraining, by gradually combining similar tokens using a fast and lightweight matching algorithm. Evaluated on ImageNet-1K and Kinetics-400 datasets.