Font Size: a A A

Micro-blog In Chinese New Words And Hot Topic Discovery Technology Research

Posted on:2016-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:T LiangFull Text:PDF
GTID:2308330461497540Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the internet, new medias is widely used, which representative with micro-blog. There are a lot new words in the massive of micro-blog text, which brings a difficulty for the research to the field of micro-blog related. Besides that, information will be continuously updated in a large number of users of the micro-blog platform, which will become new hot topic. But the users are difficult to get the accurate hot topic from the platform of the micro-blog. it’s necessary to use the method of computer to find hot topic. The main research work is as follows:Research about micro-blog new words’ discovery, we proposed a method that combines with rules and N-gram algorithm to extract the new candidate word. Firstly, we need to study the patterns of new words, and determine the scope of this study for new words. Then, according to the constitute of new words, we need to establish the relevant rules, so as to extract word fragments. For extracting debris, using N-gram algorithm to extract the candidate strings, filtering candidate strings through training garbage dictionary and word frequency to obtain the desired test candidate new word lists. Finally, on the basis of CRF framework, adding new words’ language features and statistical features to research the effects of new words’ discovery. The experiments showed that the method of candidate word extraction enhances the performance of discovery of new words obviously.Research about micro-blog hot topics’ discovery includes the calculation of text similarity and text clustering. In terms of text similarity calculation, similarity algorithm based on the cosine law and A value matrix is proposed. Firstly, select LDA model for feature selection, calculate A value of feature item; Secondly, calculate the feature item’s weight with classical TF-IDF algorithm, construct VSM model of micro-blog text, calculate cosine value between text vector according to the cosine law; Finally, adjust the relations between mathematics and semantic through parameters in order to make clustering more accurate and improve the performance of micro-blog topic discovery algorithm. In terms of text clustering, improving the Single-Pass clustering in terms of the user relationship and forward relationship. Through setting double similarity threshold identify user relationship and forward relationship to clustering. After getting the type of the original topic, use CURE clustering method merges the original topic to compensate the inaccuracy issue of the topic clustering.
Keywords/Search Tags:micro-blog, new words discovery, hot topic, clustering
PDF Full Text Request
Related items