Font Size: a A A

Topic Model For Mining Multi-domain Data And Application

Posted on:2017-11-03Degree:MasterType:Thesis
Country:ChinaCandidate:X Y ChenFull Text:PDF
GTID:2428330590991580Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,people can easily feel more and more convenience it brings.The most striking point is making information sharing and spreading much easier than ever before.However,the extremely convenience of get-ting information from the internet also has its problem,namely,information overload.People often shopping online,moreover,make friends,watch videos and browse news on the Internet.In most cases,people have to be interrupted by valueless or even dis-gusting information.Therefore,when people want to find interesting information,such as required goods,favorite movies,like-minded people and so on,the interruption will damage user experience.With the help of search engine,valuable items can be found quickly.But sometimes people don't clearly know the requirement,recommender sys-tem is proposed to handle this problem.Nowadays,people's needs are diverse.For example,if someone likes a movie adapted from a novel,he will likely to read the novel.Moreover,people are interested in a news shared by their like-minded friends.In this paper,we will analyze the data from multi-domain datasets and then find its meaningful application in the real world.My paper will focus on the research of mining and clustering data from multi-domain datasets and explore the application in cross-domain recommender system.In this paper,we propose a novel algorithm called OVCLDA which can address and analyze multi-domain datasets.It can discover the latent topics cross domains and cluster terms based on these topics.Our model can compute not only the multinomial distribution of words over topics but also the multinomial distribution of words over corpus IDs.In this model,latent features can be shared across domains and each domain can also hold its specific features.Online learning is also introduced to this algorithm so that it can handle streaming data.The efficiency of the algorithm is much better than Gibbs sampling which is based on Markov Chain Monte Carlo.In a word,our algorithm is very useful in this big data era.Topic model has already demonstrated its ability of analyzing textual data.It can be inferred that the variants of LDA are also competent at the same task.However,in recommendation problem,rating data is commonly used to train a model and then predict the missing rating.One of the widely used approaches in recommender system is latent factor model(LFM).LFM is essential similar to the topic model because both of them can discover the latent features and relations behind the surface.They use lower dimensional features to represent the high dimensional data.We will try to improve the existing matrix factorization algorithm and propose DSSCM to deal with the cross-domain recommendation problem.Since it can use data from various domains,it can help alleviate the data sparsity of the traditional recommendation problem.Finally,we have done a lot of experiments and the results achieved expectations.Thus,the topic model and its related model,LFM,can be used to analyze discrete data from multi-domain.It cannot be denied that these approaches are very promising in the future.
Keywords/Search Tags:Multi-domain data, Topic Model, Latent Factor Model, Clustering, Recommendation Problem
PDF Full Text Request
Related items