Font Size: a A A

Research On HM-LDA Generation Model Based On Hadoop

Posted on:2016-12-27Degree:MasterType:Thesis
Country:ChinaCandidate:L ZuoFull Text:PDF
GTID:2308330464951012Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
In recent years,with the vigorous development of Internet technology and the ncreasing popularity of Internet applications, microblog and BBS take an ncreasingly important position in social field. A variety of topics,words and fresh hing quickly spread through the platforms of BBS and microblog, also followed by variety of network events which happen more frequently. Internet-related public opinion monitoring and tracking are also attached great importance to by research nstitutions. How to discover a hot topic from the mass of unstructed data, and grasp he trend of public opinion has become the focus of research.In this paper, public opinion information about colleges on Internet is our focus. Through collecting information on BBS, microblog and other campus thematic networks, it is possible to find out what students care about. Regular expressions ;an be helpful for filtering all kinds of noise data corpus, preparing for the model puilding. Considering of the short text and a vast amount of data, the LDA model algorithm in this paper is optimized. By using a clustering algorithms in HM-LDA nodel, and adding user comments and forwarding information to modeling, the lifficulty of the lack of information in short text can be solved.In the mean time, ccuracy of the topic mining can be improved. By running the algorithms on Hadoop platform and using MapReduce programming model in distributed computing,the purpose of efficient processing of massive amounts of data can be achieved.Through experiments,the accuracy and efficiency in the distributed processing (?) topic detection by HM-LDA is verified. In front of massive data, distributed HM-LDA topic modeling has better accuracy than LDA algorithms. The results also lemonstrated that the more cluster computer nodes there are, the higher efficiency the algorithm will be under the premise of the same data size.
Keywords/Search Tags:topic mining, Short text, distribution, Hadoop
PDF Full Text Request
Related items