Font Size: a A A

The Study On The Extraction Of The News Topic Based On Web Mining Of Micro-blog Hot Words

Posted on:2015-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:R Q TangFull Text:PDF
GTID:2298330467989319Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of Internet technology, the traditional pattern of mediatechnology has changed gradually, the internet new media technology has thecharacteristics of the fast spread and high transparency. Micro-blog, one of the mostefficient media Internet news communication,has become growing in popularityamong the more and more young people. The appearance of Micro-blog provides anew way and idea for the news topic discovery. Existing news mining model andalgorithm is mainly to obtain through the crawl of Webpage information, and hassome problems of slow data updating, bad real-time performance, low accuracy ofmining. It has certain practical significance to provide news topic through micro-bloghot word mining.This paper establishes the model of probabilistic topic based on LDA through theanalysis of the basic theory and a series of technical, used for hot words of micro-blogdata mining and analysis. It is to design Webpage text crawler algorithm of supportingfor dynamic page. The algorithm first analyzes the JavaScript file or web page code,use the HTTP protocol to send a request to the service-specific information, thisapproach is similar to the behavior of people browsing the web the way informationbe possible to efficiently analyze the information content of the web page loadedasynchronously. At the same time, it is to proposed a set of duplicate and advertisingWeibo data filtering method in the original microblog data filtering, and to proposed aformula for the calculation of hot words, in the use of probability to obtain the mainnews topics LDA model based on a comprehensive analysis of each keyword by thebreadth and sudden calculated for each univocal word units corresponding keywordsdetermine heat hot news, and returned to the user to use.In this paper, the main research work includes the following aspects.The establishment of a proper and effective data acquisition and pre-processingmodel text, text crawler algorithm designed to support dynamic web pages used tocollect the data pages of text, and an example of Sina Weibo data acquisition throughan open platform for collecting the data were analyzed using ICTCLAS processingsystem, and to stop word processing, the final result will be a text preprocessingfeature representation;Determine the final text of the theme by establishing a hot news topic model based on probabilistic LDA. After the experimental evaluation of the display, LDAmodel based on probabilistic topic presented in this paper through the microblognetwork data can effectively extract hot news.
Keywords/Search Tags:Micro-blog hot words, Subject of News, OAuth Agreement, LDAModel, Probability of Topic Model
PDF Full Text Request
Related items