Font Size: a A A

An Improved Approach To Weighting Chinese Terms Using Information Gain

Posted on:2009-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:X L ChenFull Text:PDF
GTID:2178360272975119Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
People can easily get more and more knowledge along with the rapid development of the network and information technology. However, it is more and more difficult to locate the specific knowledge which people need quickly. An important research is focused on how to extract valuable information from the massive information. A great deal of technology in organizing and processing information occurs to the people, text classification is one of them. Text classification can process massive documents and solve the problem of information disorder to a great extent, and is convenient for user to find the required information quickly.Text classification mainly includes pre-process, split-word, feature selection and so on. Algorithm of feature term weighting based on VSM and algorithm of classification are always the research hotspot in the text classification. This paper primarily studies on algorithm of feature term weighting and algorithm of classification. Due to the limitations, the improved algorithms are presented in this paper, and experiments are done to verify the correctness of the improved algorithms. The main work of the paper as follows:①Analyzing the formulas of feature term weight TFIDF and TF.IDF.IG.②Analyzing TF.IDF.IG further and finding the improvement of TF.IDF.IG is not complete. TF.IDF.IG is taken into account only the distribution of feature term in document set. It does not involve the distribution of feature term in different angles. In this paper TF.IDF.IG is improved by taking the contribution of feature term within class into account besides the contribution between classes.③Studying the algorithm KNN. The nearest neighbors of test samples are treated equally in KNN. In order to overcome the limitation effectively, in this paper, the membership degree of fuzzy mathematics is introduced into KNN to improve the function of category.④In order to verify correctness of improved TF.IDF.IG and the effectiveness of improved KNN in the Chinese document classification, there are two contrastive experiments: 1) the experiment result of improved TF.IDF.IG is compared with that of TF.IDF.IG; 2) the experiment result of improved KNN classification is compared with that of KNN classification. The experiment results prove the improved TF.IDF.IG is correct, successful and practical. In the meantime, the improved KNN classification is also correct and practical.
Keywords/Search Tags:Feature selection, Feature vector, Vector Space Model, KNN algorithm
PDF Full Text Request
Related items