Font Size: a A A

Analysis Of Micro-blog Public Opinion Based On Text Information Extraction From Webpage

Posted on:2014-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z T XiongFull Text:PDF
GTID:2268330422950377Subject:Computer technology
Abstract/Summary:PDF Full Text Request
According to statistics released by the China Internet Network Information Center(CNNIC), the micro-blog users in China reached309million in total by the end of December2012. The advantages of micro-blog including the mode of transmission of fission, diversifiedcommunication terminal, low threshold, high interactivity and so on make it an importantbirthplace of the network of public opinion. Public sentiment index released by the networkpublic opinion from the Communication University of China (word of mouth) Institute in July2011reflected that micro-blog has become China’s second largest source of public opinionafter the news media reports and is playing an increasingly important role in the direction ofpublic opinion. How to get access to the micro-blog public opinion information promptly,understand the current situation of public opinion and predict the trend of public opinion inorder to make good use of the benefits and eliminate defects has become an important newtopic of public opinion research.Based on this background, the method of processing micro-blog data and analyzingpublic opinion with Web information extraction technology is researched in the paper. Firstly,according to the characteristics of micro-blog text, the Heritrix topic Web crawler is used tocollect micro-blog pages, and store them in the form of a mirror pages. Next, the good accessDOM tree structure for collected pages is built up combined with HTML tags nestedcharacteristics. As to free micro-blog text form and nonstandard language, standardizedtreatment methods are put forward with the combination of manually tagging and the networkcorpus processing dealing with punctuation, emoticons, stop words, non Login words, etc thatcontained in the text. In the stage of Chinese word segmentation and part-of-speech tagging,the comparation is made between Rwordseg segmentation tool in R language and NLPIRChinese word segmentation system. As the short micro-blog text content clustering is easy to cause the problem of data sparseness, the LDA model is used to represent micro-blog text inthis paper comparing advantages and disadvantages of division-based clustering method withhierarchical clustering method and putting forward a new method with the combination ofk-means clustering and hierarchical clustering algorithm. During the time of public opinionanalysis, we process subjective and objective text classification based on2-POS model, andtag emotional words with CRFs method combining the laws of emotion words with contextinformation. Finally, orientation analysis is made on the topics of micro-blog and viewpointsof the comment by means of emotional dictionary.As to technical means and methods used in the paper, we carry out the experimentalstudies, the comparison of the data and quantitative analysis on Sina, the representative of thedomestic micro-blog. Preliminary experimental results show that techniques and methodssuch as the R language word segmentation, LDA model, short text clustering combinedk-means with hierarchical clustering method,2-POS model, and CRFs holding certainadvantages over other traditional methods of data processing in the micro-blog data can betteractualize micro-blog public opinion data extraction, statistics and analysis.
Keywords/Search Tags:Micro-blog, Information Extraction, Text Clustering, Analysis of PublicOpinion
PDF Full Text Request
Related items