2020 JointCommonsenseandRelationReas

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

Quotes

Abstract

Exploiting relationships among objects has achieved remarkable progress in interpreting images or videos by natural language. Most existing methods resort to first detecting objects and their relationships, and then generating textual descriptions, which heavily depends on pre-trained detectors and leads to performance drop when facing problems of heavy occlusion, tiny-size objects and long-tail in object detection. In addition, the separate procedure of detecting and captioning results in semantic inconsistency between the pre-defined object / relation categories and the target lexical words. We exploit prior human commonsense knowledge for reasoning relationships between objects without any pre-trained detectors and reaching semantic coherency within one image or video in captioning. The prior knowledge (e.g., in the form of knowledge graph) provides commonsense semantic correlation and constraint between objects that are not explicit in the image and video, serving as useful guidance to build semantic graph for sentence generation. Particularly, we present a joint reasoning method that incorporates 1) commonsense reasoning for embedding image or video regions into semantic space to build semantic graph and 2) relational reasoning for encoding semantic graph to generate sentences. Extensive experiments on the MS-COCO image captioning benchmark and the MSVD video captioning benchmark validate the superiority of our method on leveraging prior commonsense knowledge to enhance relational reasoning for visual captioning.

Introduction

Most existing methods for image and video captioning (Donahue et al. 2015; Venugopalan et al. 2015b; 2015a; Pan et al. 2016) are based on the encoder-decoder frame- work which directly translates visual features into sentences, without exploiting high-level semantic entities (e.g., ob- jects, attributes, and concepts) as well as relations among them. Recent work (Yao et al. 2018; Li and Jiang 2019; Yang et al. 2019) has shown promising efforts of using a scene graph that provides an understanding of semantic relationships for image captioning. These methods usually use pre-trained object and relationship detectors to extract a scene graph and then reason about object relationships in the graph. However, when facing detection challenges, such as heavy occlusion, tiny-size objects, and the long-tail prob- lem, this paradigm might not accurately depict the objects and their relationships in images or videos, thus resulting in a degradation of captioning performance.

As we know, human beings can still describe images and videos by summarizing object relationships when some ob- jects are not precisely identified or even absent, thanks to their remarkable reasoning ability based on prior knowledge. This inspires us to explore how to leverage prior knowledge to achieve relation reasoning in captioning, mimicking the human reasoning procedure. As an augmentation of the object relationships explicitly inferred from an image or a video, the prior knowledge about object relationships in the world provides information that is not available in the image or video. For example, as shown in Figure 1, the cap- tion of “Several people waiting at a race holding umbrel- las” will be generated via prior knowledge when describing a crowd of people standing along the road, even if the image shows no players or running actions (perhaps because the game is yet to begin). Clearly, the relationship of “people waiting race” is inferred from the commonsense relation- ship between “people” and “race” rather than from the image. Therefore, it is beneficial to integrate prior knowledge with visual information to reason relationships for generate ing accurate and reasonable captions.

In this paper, we utilize prior knowledge to guide the rea- soning of object relationships for image and video captioning. The prior knowledge provides commonsense semantic correlations and constraints between objects to augment vi- sual information extracted from images or videos. We em- ploy external knowledge graphs in Visual Genome (Krishna et al. 2017) which represents a type of prior knowledge in that the nodes represent the objects and the edges denote the relations between nodes.

Related Work

Recently, exploiting relationships between objects for image captioning has received increasing attention. Yao et al. (2018) employed two graph convolutional networks (GCNs) to reason semantic and spatial correlations among visual features of detected objects and their relationships to boost image captioning. Li et al. 2019 generated scene graphs of images by detectors, and built a hierarchical attention-based model to reason visual relationships for image captioning. Yang et al. 2019 incorporated language inductive bias into a GCN based image captioning model to not only reason re- lationship via GCN but also represent visual information in language domain via a scene graph auto-encoder for easier translation. These methods explicitly exploit high-level semantic concepts via the pre-defined scene graph of each image and the annotations of object and relationship locations in the image. Quite different from their methods, our method utilizes prior knowledge to generate a graph of latent semantic concepts in an image or a video, without requiring any pre-trained detectors. Moreover, our iterative algorithm enables the scene graph generation and captioning to be trained in an end-to-end manner, thus alleviates the semantic incon- sistency between the pre-defined object/relation categories and the target lexical words.

Some recent methods apply external knowledge graphs for image captioning. In (Aditya et al. 2018), the com- monsense reasoning is used to detect the scene description graph of an image, and the graph is directly translated into a sentence via a template-based language model. CNet- NIC (Zhou, Sun, and Honavar 2019) incorporates knowledge graphs to augment information extracted from images for captioning. Different from these methods that directly extract explicit semantic concepts from external knowledge, our method uses external knowledge to reason relationships between semant...ic concepts via joint commonsense and rela- tion reasoning, without facing the “hallucinating” problem as stated by (Rohrbach et al. 2018).

Some Visual Question Answering (VQA) methods (Berant et al. 2013; Fader, Zettlemoyer, and Etzioni 2014; Su et al. 2018; Mao et al. 2019) apply commonsense or relation reasoning. In these methods, almost the entire semantic graph is given in terms of the question sentences, while the semantic graph is built only by using the input visual cues for image and video captioning with reasoning. The reasoning problem in image and video captioning is thus more challenging. To tackle this problem, we leverage the prior knowledge to help reasoning and propose a joint learning method to implement the reasoning.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2020 JointCommonsenseandRelationReasXiaoxun Zhang
Jiebo Luo
Jingyi Hou
Xinxiao Wu
Yayun Qi
Yunde Jia
Joint Commonsense and Relation Reasoning for Image and Video Captioning2020