Fastformer: Additive Attention Can Be All You Need ( Machine Learning Research Paper Explained)

Views: 6

, attention, transformer, fastformer Transformers have become the dominant model class in the last few years for large data, but their quadratic complexity in terms of sequence length has plagued them until now. Fastformer claims to be the fastest and most performant linear attention variant, able to consume long contexts at once. This is achieved by a combination of additive attention and elementwise products. While initial results look promising, I have my OUTLINE: 0:00 Intro Outline 2:15 Fastformer description 5:20 Baseline: Classic Attention 10:00 Fastformer architecture 12:50 Additive Attention 18:05 QueryKey elementwise multiplication 21:35 Redundant modules in Fastformer 25:00 Problems with the architecture 27:30 Is this even attention 32:20 Experimental Results 34:50 Conclusion Comments Paper: Abstract: Transformer is a powerful model for text understandi