2019 MultimodalTransformerforUnalign

(Tsai et al., 2019) ⇒ Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. (2019). “Multimodal Transformer for Unaligned Multimodal Language Sequences.” In: Proceedings of the conference. Association for Computational Linguistics. Meeting. doi:10.18653/v1/p19-1656

Notes

Cited By

http://scholar.google.com/scholar?q=%222019%22+Multimodal+Transformer+for+Unaligned+Multimodal+Language+Sequences

Quotes

Abstract

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise cross-modal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.

Introduction

...

Figure 1: Example video clip from movie reviews. [Top]: Illustration of word-level alignment where video and audio features are averaged across the time interval of each spoken word. [Bottom] Illustration of crossmodal attention weights between text (“spectacle”) and vision/audio.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2019 MultimodalTransformerforUnalign	Louis-Philippe Morency Ruslan Salakhutdinov Yao-Hung Hubert Tsai Shaojie Bai Paul Pu Liang J Zico Kolter			Multimodal Transformer for Unaligned Multimodal Language Sequences				10.18653/v1/p19-1656		2019