| Abstract:In the face of Chinese micro-blog text,which are wrote casual, incomplete information,big noise, then extract the key information become the focus of Chinese Natural Language Processing. Automatic keyword extraction is a branch in the field of text mining,also the basic work of dealing with the text processing research such as text retrieval,text comparison, text classification and clustering.The main content of this paper is how to extract data from a text Chinese micro-blog theme words can explain micro-blog content, namely keyword.The traditional artificial method in such a huge amount of data in micro-blog information is not applicable. In this paper the probabilistic topic model LDA applied to the foundation Chinese keyword extraction,on the basis of using statistical methods in "vocabulary level" introducing external semantic repository, increase the weight of semantic words, put forward a model of the probability of multiple features fusion theme, make extracting keywords more accurate and more practical.The main work is as follows:First, we study Chinese micro-blog data structure, and analyze the existing latent semantic model in Chinese micro-blog data.Second, we study latent topic model in Chinese micro-blog,detailed analysis of the characteristics of latent topic model,constructed the bag of words model about Chinese micro-blog. In order to make up for the loss of information for representing text the shortcomings of simple using word frequency information, reduce the data sparseness of short text, and can better express and quantify the uncertainty in the short text,the theme of the text vector distribution map directly to the internal implied theme.Third, Chinese micro-blog theme is determined by its own content, this paper puts forward a multi-feature of the text itself implies semantic and external semantic fusion probability model based on the latent topic model, combined with the "HowNet" semantic database with statistical weights, from coarse-grained and fine-grained "thematic hierarchy""lexical level" Research on vocabulary, keyword extraction. And we carry through demonstration analysis on this method, through analyzing the experimental results, the experimental results show that this method has a very good effect in Chinese micro-blog keyword extraction, and has good practical value. |