De BERTa: Decoding enhanced BERT with Disentangled Attention ( Machine Learning Paper Explained)

Views: 9

, deberta, bert, huggingface DeBERTa by Microsoft is the next iteration of BERTstyle SelfAttention Transformer models, surpassing RoBERTa in Stateoftheart in multiple NLP tasks. DeBERTa brings two key improvements: First, they treat content and position information separately in a new form of disentangled attention mechanism. Second, they resort to relative positional encodings throughout the base of the transformer, and provide absolute positional encodings only at the very end. The resulting model is both more accurate on downstream tasks and needs less pretraining steps to reach good accuracy. Models are also available in Huggingface and on Github. OUTLINE: 0:00 Intro Overview 2:15 Position Encodings in Transformer s Attention Mechanism 9:55 Disentangling Content Position Information in Attention 21:35 Disentangled Query Key construction in the Attention Formula 25:50 Efficient Relative Position Encodings 28:40 Enhanced Mask Decoder usin