Sparsifying Transformer Models with Differentiable Representation Pooling
Sparsifying Transformer Models with Differentiable Representation Pooling
Michal Pietruszka,Łukasz Borchmann,Filip Grali'nski
2020 · DBLP: journals/corr/abs-2009-05169
arXiv.org · 3 Citations
TLDR
A novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations, thus leveraging the model's information bottleneck with twofold strength.
