Introduction
Cordelia Schmid is a computer vision researcher and research director at France's National Institute for Research in Digital Science and Technology (Inria). She holds a joint appointment at Google Research.
This is a cross-disciplinary, technical talk, suitable for researchers in the field.
About the event
In this talk, Cordelia Schmid will present recent progress on large-scale learning of multimodal video representations, starting with VideoBERT, Google's joint model for video and language, which has repurposed the Bidirectional Encoder Representations from Transformers (BERT) model for multimodal data.
This model achieves state-of-the-art results on zero-shot prediction and video captioning. Schmid will go on to show how to extend learning from instructional videos to general movies based on cross-modal supervision.
Dr Schmid and her team use movie screenplays to teach speech-to-action classifiers and use these classifiers to mine video clips from thousands of hours of movies. She will demonstrate a performance comparable to or better than fully supervised approaches for action classification. Next, she will present an approach for video question-answering which relies on training from instructional videos and cross-modal supervision with a textual question-answer module.
The talk will conclude by presenting a recent video feature which is fully transformer-based. Schmid’s group’s Video Vision Transformer (ViViT) is shown to outperform the state-of-the-art on video classification. Furthermore, it is flexible and allows for performance / accuracy trade-off based on several different architectures.