7 Interesting Papers from ACM MM 2019[
This is translated from 新智元导读
Yinwei Wei，Xiang Wang，Liqiang Nie，Xiangnan He，Richang Hong，Tat-Seng Chua Personalized recommendation plays a central role in many online content sharing platforms. To provide quality micro-video recommendation service, it is of crucial importance to consider the interactions between users and items (i.e. micro-videos) as well as the item contents from various modalities (e.g. visual, acoustic, and textual). Existing works on multimedia recommendation largely exploit multi-modal contents to enrich item representations, while less effort is made to leverage information interchange between users and items to enhance user representations and further capture user’s fine-grained preferences on different modalities. In this paper, we propose to exploit user-item interactions to guide the representation learning in each modality, and further personalized micro-video recommendation. We design a Multi-modal Graph Convolution Network (MMGCN) framework built upon the message-passing idea of graph neural networks, which can yield modal-specific representations of users and micro-videos to better capture user preferences. Specifically, we construct a user-item bipartite graph in each modality, and enrich the representation of each node with the topological structure and features of its neighbors. Through extensive experiments on three publicly available datasets, Tiktok, Kwai, and MovieLens, we demonstrate that our proposed model is able to significantly outperform state-of-the-art multi-modal recommendation methods.
Yongqi Li，Meng Liu，Jianhua Yin，Chaoran Cui，Xin-Shun Xu，Liqiang Nie In the past few years, micro-videos have become the dominant trend in the social media era. Meanwhile, as the number of microvideos increases, users are frequently overwhelmed by their uninterested ones. Despite the success of existing recommendation systems developed for various communities, they cannot be applied to routing micro-videos, since users in micro-video platforms have their unique characteristics: diverse and dynamic interest, multilevel interest, as well as true negative samples. To address these problems, we present a temporal graph-guided recommendation system. In particular, we first design a novel graph-based sequential network to simultaneously model users’ dynamic and diverse interest.Similarly, uninterested information can be captured from users’true negative samples. Beyond that, we introduce users’ multi-level interest into our recommendation model via a user matrix that is able to learn the enhanced representation of users’ interest. Finally, the system can make accurate recommendation by considering the above characteristics. Experimental results on two public datasets verify the effectiveness of our proposed model.
Jiaxin Wu，Sheng-hua Zhong，Yan Liu Multi-video summarization, which tries to generate a single summary for a collection of video, is an important task in dealing with ever-growing video data. In this paper, we are the first to propose a graph convolutional network for multi-video summarization. The novel network measures the importance and relevance of each video shot in its own video as well as in the whole video collection. The important node sampling method is proposed to emphasize the effective features which are more possible to be selected as the final video summary. Two strategies are proposed to integrate into the network to solve the inherent class imbalance problem in the task of video summarization. The loss regularization for diversity is used to encourage a diverse summary to be generated. Extensive experiments are carried out, and in comparison with traditional and recent graph models and the state-of-the-art video summarization methods, our proposed model is effective in generating a representative summary for multiple videos with good diversity. It also achieves state-of-the-art performance on two standard video summarization datasets.
[Deep Adversarial Graph Attention Convolution Network for Text-Based Person Search](https://dl.acm.org/citation.cfm?id=3350991
Jiawei Liu，Zheng-Jun Zha，Richang Hong，Meng Wang，Yongdong Zhang The newly emerging text-based person search task aims at retrieving the target pedestrian by a query in natural language with fine-grained description of a pedestrian. It is more applicable in reality without the requirement of image/video query of a pedestrian, as compared to image/video based person search, i.e., person re-identification. In this work, we propose a novel deep adversarial graph attention convolution network (A-GANet) for text-based person search. The A-GANet exploits both textual and visual scene graphs, consisting of object properties and relationships, from the text queries and gallery images of pedestrians, towards learning informative textual and visual representations. It learns an effective joint textual-visual latent feature space in adversarial learning manner, bridging modality gap and facilitating pedestrian matching. Specifically, the A-GANet consists of an image graph attention network, a text graph attention network and an adversarial learning module. The image and text graph attention networks are designed with a novel graph attention convolution layer, which effectively exploits graph structure in the learning of textual and visual features, leading to precise and discriminative representations. An adversarial learning module is developed with a feature transformer and a modality discriminator, to learn a joint textual-visual feature space for cross-modality matching. Extensive experimental results on two challenging benchmarks, i.e., CUHK-PEDES and Flickr30k datasets, have demonstrated the effectiveness of the proposed method.
Ziling Huang，Zheng Wang，Wei Hu，Chia-Wen Lin，Shin’ichi Satoh Most person re-identification (ReID) approaches focus on retrieving a person-of-interest from a database of collected individual images. In addition to the individual ReID task, matching a group of persons across different camera views also plays an important role in surveillance applications. This kind of Group Re-identification (GReID) task is very challenging since we face the obstacles not only from the appearance changes of individuals, but also from the group layout and membership changes. In order to obtain robust representation for the group image, we design a Domain-Transferred Graph Neural Network (DoT-GNN) method. The merits are three aspects: 1) Transferred Style. Due to the lack of training samples, we transfer the labeled ReID dataset to the G-ReID dataset style, and feed the transferred samples to the deep learning model. Taking the superiority of deep learning models, we achieve a discriminative individual feature model. 2) Graph Generation. We treat a group as a graph, where each node denotes the individual feature and each edge represents the relation of a couple of individuals. We propose a graph generation strategy to create sufficient graph samples. 3) Graph Neural Network. Employing the generated graph samples, we train the GNN so as to acquire graph features which are robust to large graph variations. The key to the success of DoT-GNN is that the transferred graph addresses the challenge of the appearance change, while the graph representation in GNN overcomes the challenge of the layout and membership change. Extensive experimental results demonstrate the effectiveness of our approach, outperforming the state-of-the-art method by 1.8% CMC-1 on Road Group dataset and 6.0% CMC-1 on DukeMCMT dataset respectively.
Xufeng Qian，Yueting Zhuang，Yimeng Li ，Shaoning Xiao，Shiliang Pu，Jun Xiao What we perceive from visual content are not only collections of objects but the interactions between them. Visual relations, denoted by the triplet
Jun Hu，Shengsheng Qian， Quan Fang，Changsheng Xu Nowadays, community question answering (CQA) systems have attracted millions of users to share their valuable knowledge. Matching relevant answers for a specific question is a core function of CQA systems. Previous interaction-based matching approaches show promising performance in CQA systems. However, they typically suffer from two limitations: (1) They usually model content as word sequences, which ignores the semantics provided by non-consecutive phrases, long-distance word dependency and visual information. (2) Word-level interactions focus on the distribution of similar words in terms of position, while being agnostic to the semantic-level interactions between questions and answers. To address these limitations, we propose aHierarchical Graph Semantic Pooling Network (HGSPN) to model the hierarchical semantic-level interactions in a unified framework for multi-modal CQA matching. Instead of viewing text content as word sequences, we convert them into graphs, which can model non-consecutive phrases and long-distance word dependency for better obtaining the composition of semantics. In addition, visual content is also modeled into the graphs to provide complementary semantics. A well-designed stacked graph pooling network is proposed to capture the hierarchical semantic-level interactions between questions and answers based on these graphs. A novel convolutional matching network is designed to infer the matching score by integrating the hierarchical semantic-level interaction features. Experimental results on two real-world datasets demonstrate that our model outperforms the state-of-the-art CQA matching models.