Font Size: a A A

A Research On Feature Extraction Applied For Text Classification

Posted on:2015-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:J R PengFull Text:PDF
GTID:2298330467462143Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Text classification is one of the most significant basic technologies of text mining and information retrieval, and the effect of the classification heavily depends on the quality of the text feature extraction. Therefore, it has been a popular issue that how to figure out a better method of learning, such as deep learning, for obtaining better feature representation, which could provide more valuable information. The research in this paper focuses on text feature representation and selection.The main work of this paper contains the following four parts:1) Research and implement four classic text feature selection algorithms. A revised algorithm of feature selection is proposed, based on the reliability measurement of feature’s statistics. This algorithm improves the ability to resist the interference of the random noise induced by feature’s statics. The result of the experiment demonstrates that the algorithm can enhance the effect of classic methods of feature selection.2) Text feature and classification based on the result of LDA and semi-supervised LDA, regarded as the word embedding, are analyzed and a method of feature selection is proposed and achieved, based on LDA word embedding.The distribution of the feature word of the topic is used to evaluate the effect of the classification with this word. And this method is proved to be effective.3) Two algorithms of text feature extraction are implemented, based on deep learning. One of the algorithms is deep learning algorithm using Word2vec tool. And the effect is improved by the use of the tool which generates the word embedding. Another is a deep learning algorithm based on Stacked Denoising Autoencoders(SDA).The experiment in Cross-domain sentiment classification shows a better result than that in direct classification. 4) Algorithms of text feature weighting and text feature selection are proposed based on word activation force (WAF). A type of pair feature of associated words by WAF is raised, for which a method of feature selection is also designed. The method takes into consideration relationship information between feature words and proffers an improved result where the assumption of VSM model based on word independence fails to do. The experimental results on NEWSGROUP datasets demonstrate the effectiveness of the algorithms.
Keywords/Search Tags:text classification, text representation, feature selection, WAF, word embedding
PDF Full Text Request
Related items