Font Size: a A A

Movie Data Mining And Analysis Based On Heterogeneous Information Network

Posted on:2019-09-06Degree:MasterType:Thesis
Country:ChinaCandidate:C LiFull Text:PDF
GTID:2428330596964841Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Data-driven decisions have been continuously expanded and deepened in many industries.In movie industry,by analyzing massive movie data,the application system can effectively push preferred movies/artists to users.Recently,some companies have tried to use the results of data analysis to assist the selection of actors and the design of scripts.Many works launched in this way achieve lots of success.Although movie data is rich in information,most current analysis methods focus on specific kinds of information,which affects the completeness of results.On the other hand,heterogeneous information network(HIN)is a kind of information network which consists of multiple types of vertices and edges and can be adopted to effectively model complex data containing different kinds of information.In recent years,the problem of mining and analyzing heterogeneous information network has been widely studied and experiments show that algorithms get better results after considering heterogeneity information.Therefore,we use heterogeneous information network to organize the information involved in movie data.Furthermore we propose two kinds of embedding algorithms to represent the key information of heterogeneous network more effectively.On this basis,we propose a query-driven analysis program which effectively completes a variety of analysis tasks.In the following we list the main contents of this thesis:(1)The construction of movie information network based on the concept of heterogeneous information network.The movie data set is an information network consisting of multiple types of vertices(movie,person,studio,etc.)and relationships(person acting in a movie,person directing a movie,etc.).After obtaining and cleaning the raw data crawled from the website of movie.douban.com,the information is extracted from the data according to the predefined network schema and used to construct the network.In order to store the network in an appropriate way for the convenience of analysis,we propose the concept of SRT(Source-Relationship-Target)graph,based on which we store and manage the movie information network in an effectively way.(2)Representation learning of heterogeneous information networks and a tag-associated short text representation learning algorithm.Analysis for network structure data often requires specific algorithms.A better way is to learn vertices' low-dimensional representation that can be used in various data mining algorithms of HIN.Unlike the representation learning of homogeneous information network,the algorithms need to consider the type information for heterogeneous information network.We use the concept of meta-path to model the structure information between nodes and then learning vertices' representation based on such structure information.On the other hand,we propose to learn the short text representation of movies' summaries.However,they are represented in a symbolic way through the keywords which would lead to the problem of data sparsity.To solve this problem,an efficient tag-associated short text representation learning algorithm is proposed to obtain high-quality distributed representation of texts.(3)Query-driven heterogeneous information network analysis.After studying relevant movie data analysis tasks,we find that most tasks can be done by a series of query operations for heterogeneous information network.Based on this,we design a set of query-driven analysis operations for heterogeneous information network to deal with multiple kinds of analysis or mining tasks.The query-driven scheme includes: query description graph,SRT-graph based scheduling,and some basic computing operations.In this way,different kinds of analysis tasks can be done in a consistent manner.(4)Implementation of a prototype system based on movie data crawled from douban website.Based on the work of previous chapters and with the help of some open-source resources,we design and implement a prototype movie data analysis system.The system is composed of different layers,with each layer corresponding to a specific abstraction and providing the upper layer with the required services.The system provides two kinds of user interfaces to meet different analysis requirements.By testing the system with a bunch of analysis tasks,we find the system can effectively explore the important information involved in movie data,and serve a variety of analysis scenarios.
Keywords/Search Tags:Movie Data, Heterogeneous Information Network Analysis, Data Mining, Representation Learning
PDF Full Text Request
Related items