Font Size: a A A

Statistical Law Of The Same Frequency Words For Text Mining And Short Text Categorization

Posted on:2016-05-11Degree:MasterType:Thesis
Country:ChinaCandidate:X C LiFull Text:PDF
GTID:2308330461971613Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of mobile Internet, large-scale data is quickly accumulated in text form. More and more scholars focus on how to efficiently and accurately improve the text categorization. It is extremely urgent for studying on text processing either long text or short.Keyword extraction is one of the basic and important work in text categorization. The number of the low-frequency words often occupies a large proportion in each text, and these low-frequency words are mostly rare words and malformation words or that are not irrelevant to the topic of text. Meanwhile, these low-frequency words greatly reduce the extraction efficiency in the process of keyword extraction. At present, there is no operational standard on how to deal with the low-frequency words in the application of keywords extraction. This dissertation mainly studied the statistical law of the same frequency words of Chinese texts and then applied the results into text keyword extraction, effectively improved the efficiency of the keyword extraction. The approach solves the problem of how to deal with the low-frequency words that academic concerns, and provides an effective operational approach.Short text categorization has some characteristics, such as sparse feature, strong words ambiguity and endless stream of new vocabulary and so on. So the categorization results of short text are far worse than long text. How to improve the accuracy of short text classification has become the focus of academic research. This dissertation uses the database resources of Wikipedia to do disambiguation and feature expansion, which effectively improve the precision, recall and F1 assessed value of short text classification results.The main work of this dissertation is as follows:1) This dissertation did a large number of statistics on the same frequency words in Chinese text. It revealed two varying patterns, one of them represents the ratio of the same frequency words with 1 and different words, and the other represents the ratio of the same frequency words with n and the same frequency words with 1. Then it deduced the mathematical expression of the same frequency words in Chinese text based on Zipf’s law and can be applied to Chinese text well. Moreover, we re-established the boundary pointsformula of high-frequency words and low-frequency words, and then verify its correctness.2) This dissertation proposes the keyword extraction method based on the same frequency statistics rule of Chinese text. It provides a theoretical foundation on how to deal with the low-frequency words in the application of keywords extraction, which can improve keyword extraction effectively. It notes that when the text length is no less than 3010 words and take no account of the words with 1 or 2 frequency in calculating the value of TF-IDF, we can improve the efficiency by 2-7 times and do not cause the loss of keywords.3) It proposes a method for building the Bayesian belief network based on the category index and the links of entries in Wikipedia. In the process of short text categorization, this dissertation uses the link information of the concept nodes of Bayesian network as the basis for whether there is correlation between these words. Under this premise, it proposes a method based on Bayesian network for expanding the features for short texts, which effectively solves the problem of sparse feature and improving the accuracy of short text categorization.
Keywords/Search Tags:Same Frequency Words, Keyword Extraction, Short Text Categorization, Wikipedia, Feature Expansion
PDF Full Text Request
Related items