Font Size: a A A

Research On The Feature Dimension Reduction Method Based On Improved Mutual Information And LDA

Posted on:2017-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y HuangFull Text:PDF
GTID:2308330488985677Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text classification is a hot research field in text mining. The classification process contains several key links, and the handling of each link has significant impact on the results of text categorization, wherein the feature dimension reduction for the text is one of the most important link of the classification. How to select text features effectively is a more popular research topic recently.This paper makes the mutual information method in text feature selection as the research object, through the analysis of the deficiency of the mutual information method in feature selection, the improved mutual information method in feature selection is proposed. In feature selection, the traditional methods are based on mathematical statistics, thus ignore the semantic relations between the words. This paper combines the application of LDA model in the classification field, integrates LDA model into the traditional methods for feature selection to achieve feature dimension reduction, and to improve the effect of classification. The main work is as follows:Study the literatures, analyze the development status quo of Chinese text classification, especially the feature selection of mutual information in text categorization. Analyze the disadvantages of mutual information method in feature selection, and propose a improved mutual information method for feature selection.The mutual information method in feature selection only considers the text frequency of key words in the text sets, instead of the frequency information of key word and information between text categories. Aiming at this problem, this paper raises the improved mutual information method based on the text word frequency, introduces the concepts of discrete degree between classes and dispersion in classes.And proposes a mutual information method for feature selection combined with the frequency of key words and distinguished words in categories.The experimental results show that the proposed method can improve the effect of text categorization in a certain extent.Because of the traditional methods are based on mathematical statistics, and ignore the semantic information between the words. Based on the improved mutual information method and LDA, this paper has achieved feature dimension reduction for texts.The LDA model is established in the Linux environment, and the text classification is achieved with the KNN classification algorithm in the data mining tool WEKA. Compared with the mutual information method proposed in this paper, it is concluded that the improved mutual information method combined with LDA can achieve the better classification effect.
Keywords/Search Tags:Text classification, Improved Mutual information, LDA Model, Feature dimension reduction
PDF Full Text Request
Related items