Teach Computers to Connect Videos and Text without Labeled Data Video Clip

Views: 2

A groundbreaking way to do selfsupervision on videos and text. It s like the BERT moment for this videotext understanding. , videoclip, contrastivelearning, videotransformer 0:00 Intro 3:31 Retrieval augmented training 5:07 Video and text encoding8:48 Contrastive loss 12:09 Zeroshot transfer to end tasks 14:05 Experiment results 18:09 What did we learn VideoCLIP: Contrastive Pretraining forZeroshot VideoText Understanding Connect Twitter Linkedin email Abstract We present VideoCLIP, a contrastive approach to pretrain a unified model for zeroshot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive videotext pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks,