Researches On Text Classification Based On A Hybrid Model With Improved TFIDF

Posted on:2017-04-10

Degree:Master

Type:Thesis

Country:China

Candidate:D Chen

Full Text:PDF

GTID:2308330488985674

Subject:Computer application technology

Abstract/Summary:

Text classification is a key technology to identify needed information from large and messy text accurately and rapidly. During the process of text classification, text need to be preprocessed before sending to classifier, including text segmentation, stop word elimination, feature selection and feature extraction. Feature selection and feature extraction can eliminate the noise data in the text, and reduce the dimension of text feature space. This process is crucial, because it can directly affect the classification accuracy. Based on feature selection and feature extraction, this paper proposes a hybrid model based on vector space model and topic model, that can make the feature vector of text carrying category information as much as possible and reduce the dimensions at the same time. The work of this paper is as follows:First of all, improve TFIDF algorithm. Introduces the coefficient of variation, and proposes an improved method TFIDFCV. The method takes the variation coefficient as weighting factor, comprehensive considers the information distribution of feature word between-class and within-class, adjusts the weight calculation of TFIDF on feature item, can avoid disadvantages of traditional TFIDF method that doesnâ€™t consider the distribution of feature item between-class and within-class, and therefore features can be selected from text more efficiency. Secondly, put forward a hybrid model. Extracting features from text by employing LDA topic model, can decrease the dimension of feature space. By modelling noun, verb and other words respectively, part of speech information can be utilized effectively, to establish a part of speech integrated LDA model, which is PST-LDA. Therefore using PST-LDA model and TFIDFCV method to manage text corpus, combining the information such as word frequency, part of speech and topic etc., to acquire features that containing more information. Thirdly, experiment verification. By designing two experiments, analyze and verify the improved results. The first group, compare the results using TFIDF and TFIDFCV method for text classification under support vector machine. The experiment results imply that the value of macro F1 increased by 1.21% in TFIDFCV method compared to TFIDF method. The second group is comparing text classification effect of LDA,PST-LDA, combined TFIDFCV and PST-LDA, the results suggest that the value of macro F1 in that combined method increased by 1.1% comparing to PST-LDA,0.92% comparing to LDA, and time spending in modelling is less than half of LDA method.

Keywords/Search Tags:

Feature selection, Feature extraction, TFIDFCV, PST-LDA

Related items

1	Design And Implementation Of Feature Extraction System For Large-Scale Structured Data
2	Handwritten Numeral Recognition, Feature Extraction And Feature Selection
3	Study Of Graph-based Feature Extraction And Feature Selection With Their Applications
4	Feature Extraction And Selection Of Ultrasound Images Of Live Cancer
5	Research On Some Feature Extraction Methods
6	Research Of Feature Extraction And Selection Algorithm In Capsule Endoscopy Images
7	Feature Selection Based On Feature Extraction
8	Feature Extraction And Feature Fusion For Content-Based Image Retrieval
9	Research On Pulmonary CT Image Analysis And Feature Extraction
10	Research On Feature Extraction And Feature Selection For Hyperspectral Remote Sensing Data