Co At Net: Marrying Convolution and Attention for All Data Sizes
CoAtNet: Marrying Convolution and Attention for All Data Sizes Paper abstract: Transformers have attracted increasing interests in computer vision, but they still fall behind stateoftheart convolutional networks. In this work, we show that while Transformers tend to have larger model capacity, their generalization can be worse than convolutional networks due to the lack of the right inductive bias. To effectively combine the strengths from both architectures, we present CoAtNets(pronounced coat nets), a family of hybrid models built from two key insights: (1) depthwise Convolution and selfAttention can be naturally unified via simple relative attention; (2) vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency. Experiments show that our CoAtNets achieve stateoftheart performance under different resource constraints across various datasets: Without extra data, CoAtNet achieves 8
|
|