Font Size: a A A

Research On Topic Detection Technology For Chinese Micro-blog

Posted on:2016-12-24Degree:MasterType:Thesis
Country:ChinaCandidate:H Y ChenFull Text:PDF
GTID:2308330461971346Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development of web2.0 and mobile communication technology, micro-blog begins to rise and becomes the main platform for the user to distribute personal information, share it with other people, as well as get some information from other people. Micro-blog platform attracts more and more cyber citizens’ attention to register and become the users with its simpleness and convenience, moreover, micro-blog information transmission has the characteristics of fission, which both lead to the phenomenon that lots of texts appear in the micro-blog platform. As a consequence, the users are stuck in the information overload. The study on Chinese micro-blog topic discovery technology-the micro-blog mentioned in the whole paper refers to Chinese micro-blog, can classify the texts into different categories and organize them, bringing so many advantages. First, it can help to deal with the problem of information overload, making the users have a quick understand of topic information existing in the micro-blog space. Second, it can lay a solid foundation for micro-blog hot topic discovery and tracking. Lastly, it also provide evidence for making reasonable decisions for the user, timely find the Internet rumors and take steps to curb its spread, correctly guiding network public opinions, purifying the Internet environment and promoting the healthy development of the micro-blog platform.Micro-blog topic discovery mainly involves the following key technologies. They are the data acquisition for micro-blog, text pre-processing, text feature selection, similarity computation between texts and text clustering, respectively. Among which, the two key technologies-text feature selection and similarity computation between texts, are selected to be further studied in this paper, meanwhile, according to the existing problems, the improved algorithms are also correspondingly put forward.Firstly, micro-blog text is short, with less information, which causes that its valid features are sparse and difficult to extract. In view of it, a novel method of feature selection on micro-blog text, which is based on statistics and semantic information, is proposed. It mainly adopts three strategies of POS grouping, evaluation function integrating TF-IDF(Term Frequency-Inverse Document Frequency), POS with length of term, and semantic relevance between term and micro-blog text to accomplish the feature selection on micro-blog text. Then, it is launched with Naive Bayesian categorization algorithm, and the experimental results on an open micro-blog corpus show the proposed algorithm can acquire a high precision rate of text categorization compared with the traditional strategies, indicating that the selected terms by the proposed algorithm can represent the topic of micro-blog text more accurately.Secondly, For the inaccuracy of micro-blog text similarity calculation caused by sparse features, as well as the semantic relevance between words and structured information existing in the micro-blog text being not well considered in the traditional text similarity algorithm, a novel method of similarity on micro-blog text based on time, semantic and social relationships, is proposed. First, the definition of common blocks between micro-blog texts is extended and text semantic similarity model based on common block sequence is newly established. Second, the creating time of micro-blog texts and the structured information such as forwarding and comment between them are used to revise text semantic similarity model, commonly measuring the similarity between micro-blog short texts. Lastly, the algorithm is used to discover micro-blog topics, combined with Single-Pass clustering algorithm. Experimental results show that the proposed algorithm can measure the similarity between micro-blog texts more precisely, compared with the traditional algorithms.Finally, with the two kinds of improved algorithm fused together, a method of micro-blog topic discovery based on the integration of feature selection and similarity measurement is presented. Through the experiment on Chinese micro-blog, the results show that the method can more effectively improve the quality of micro-blog topic discovery, compared with the method of micro-blog topic discovery based on text similarity algorithm.
Keywords/Search Tags:Chinese Micro-blog, feature selection, common block sequence, text similarity, Micro-blog topic detection
PDF Full Text Request
Related items