Font Size: a A A

Researches On Text Classification Based On A Hybrid Model With Improved TFIDF

Posted on:2017-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:D ChenFull Text:PDF
GTID:2308330488985674Subject:Computer application technology
Abstract/Summary:
Text classification is a key technology to identify needed information from large and messy text accurately and rapidly. During the process of text classification, text need to be preprocessed before sending to classifier, including text segmentation, stop word elimination, feature selection and feature extraction. Feature selection and feature extraction can eliminate the noise data in the text, and reduce the dimension of text feature space. This process is crucial, because it can directly affect the classification accuracy. Based on feature selection and feature extraction, this paper proposes a hybrid model based on vector space model and topic model, that can make the feature vector of text carrying category information as much as possible and reduce the dimensions at the same time. The work of this paper is as follows:First of all, improve TFIDF algorithm. Introduces the coefficient of variation, and proposes an improved method TFIDFCV. The method takes the variation coefficient as weighting factor, comprehensive considers the information distribution of feature word between-class and within-class, adjusts the weight calculation of TFIDF on feature item, can avoid disadvantages of traditional TFIDF method that doesn’t consider the distribution of feature item between-class and within-class, and therefore features can be selected from text more efficiency. Secondly, put forward a hybrid model. Extracting features from text by employing LDA topic model, can decrease the dimension of feature space. By modelling noun, verb and other words respectively, part of speech information can be utilized effectively, to establish a part of speech integrated LDA model, which is PST-LDA. Therefore using PST-LDA model and TFIDFCV method to manage text corpus, combining the information such as word frequency, part of speech and topic etc., to acquire features that containing more information. Thirdly, experiment verification. By designing two experiments, analyze and verify the improved results. The first group, compare the results using TFIDF and TFIDFCV method for text classification under support vector machine. The experiment results imply that the value of macro F1 increased by 1.21% in TFIDFCV method compared to TFIDF method. The second group is comparing text classification effect of LDA,PST-LDA, combined TFIDFCV and PST-LDA, the results suggest that the value of macro F1 in that combined method increased by 1.1% comparing to PST-LDA,0.92% comparing to LDA, and time spending in modelling is less than half of LDA method.
Keywords/Search Tags:Feature selection, Feature extraction, TFIDFCV, PST-LDA
Related items