With the rapid development of the Internet,Internet users are producing a large amount of social multimedia data every day.Relational information is ubiquitous in social multimedia data and is widely used in recommendation systems,expert discovery,and other applications for mining implicit valuable information.A network is the most commonly used data structure for modeling relational information in social multimedia data,and we denote such network data as social multimedia networks.In recent years,a large number of social multimedia network applications have emerged due to the development of machine learning techniques such as deep learning.One of the fundamental problems that these applications have to deal with is how to effectively learn low-dimensional feature representation that can well capture the implicit semantics for nodes of social multimedia networks.A lot of social multimedia network applications rely on the semantics provided by the learned network node representation and can benefit from better network node representation.However,for real-world social multimedia networks,the structures have heterogeneous and sparse properties,and the vertices may be associated with multi-modal content information.Therefore,social multimedia network representation learning is a non-trivial task.We study network representation approaches and applications for social multimedia networks,and design models for specific tasks.The main contributions of our works are as follows:(1)To deal with the heterogeneous,sparse,and multi-modal properties of real-world social multimedia network data,we propose a novel Attention-aware Collaborative Multi-modal Heterogeneous Network Embedding model(A2CMHNE).A novel metapath based approach is proposed to exploit the implicit structural information and multi-modal content information in heterogeneous multi-modal networks for learning structure-based and content-based representations for vertices.In addition,an attention-based collaborative mechanism is proposed to integrate the structure-based and content-based representations for obtaining robust vertex representations.Experimental results on node classification tasks and link prediction tasks show that the collaboration of structural information and multi-modal content information can effectively deal with the heterogeneous,sparse,and multi-modal properties of heterogeneous multi-modal networks.(2)To deal with the heterogeneous property of social elements for Community Question Answering(CQA)matching tasks,we propose a Multi-modal Attentive Interactive Convolutional Matching method(MMAICM),which models the social elements of CQA systems with heterogeneous networks and learns the social context representations of questions and answers based on the networks.MMAICM leverages meta-paths to capture the implicit structural information in heterogeneous CQA networks and learn the social context representation of questions and answers based on the captured structural information.The learned social context representation of questions and answers is utilized by an attention model to capture meaningful interaction between questions and answers,which can effectively improve the performance of our matching model.Experimental results on two real-world CQA datasets verify that the social context information obtained by the network representation learning approach is effective for better capturing the matching patterns between questions and answers.(3)To deal with the multi-modal property of data for Community Question Answering(CQA)matching tasks,we propose a Hierarchical Graph Semantic Pooling Network(HGSPN),which models the multi-modal content of questions and answers as networks and learns hierarchical semantic representations based on the constructed networks for capturing the semantic-level interactions between the multi-modal content of questions and answers.We are among the first to model the multi-modal content as networks in CQA matching,which can capture non-consecutive and longdistance semantics,as well as visual information.A well-designed stacked graph pooling network is proposed to capture the hierarchical semantic-level interactions between questions and answers based on the networks.The learned hierarchical semantic-level interactions are then integrated by a novel convolutional matching network to infer the relevance between questions and answers.Experimental results on two real-world CQA datasets demonstrate that networks can be used to effectively model the semantic-level interaction between multi-modal content. |