Font Size: a A A

Research On Feature Extraction And Classification Algorithm In Text Categorization

Posted on:2019-10-19Degree:MasterType:Thesis
Country:ChinaCandidate:H F LiFull Text:PDF
GTID:2428330548976808Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
The gradual maturity of the Internet and the development of social media such as micro-blog have greatly changed the way people live,and more and more users like to publish information,view real time information and evaluate information through the network.The rapid expansion of network traffic leads to the rapid increase of network data,which makes the user find a contradictory relationship between the speed of searching the required information and the amount of data.Text categorization is one of the effective ways to deal with and reduce information overload.Feature extraction and classification modeling are the two most important tasks in text categorization.There are many problems in Chinese text,such as polysemyand synonym,making semantic method has a good effect in feature extraction.In addition,considering the contribution of boundary samples and class center samples to classification contribution,enhancing the role of boundary samples in classification is beneficial to improving classification performance.At the same time,in traditional classification modeling,a single classifier is usually used to model.The characteristics of single classifier determine the different classification effects in different application scenarios,making the single classifier lose part of the sample information in different scenarios.In view of the above problems,this paper has done the following research:(1)In the feature extraction method,in view of the problem that the Sprinkling semantic feature extraction method does not fully consider the weight of the sample category,this paper presents a K-Sprinkling feature extraction method based on the sample weight function,and the sample weight function is used to represent the size of sample class contribution to the sample.This method uses the Cauchy distribution membership function as the weight of the sample class,and improves the membership function of the Cauchy distribution by analyzing the shortcomings of the membership function of the Cauchy distribution,thus constructing the sample class weight function which combines the sample membership and the sample tightness,and integrates the weight of the samples into the Sprinkling.In the feature vector space,singular value decomposition is used to realize the deep level feature mining,thus completing the text classification task.(2)In the classification modeling method,considering the different concerns of different feature extraction methods,a text classification method based on multi type classifier fusion is proposed.The method combines word2vec,principal component analysis,latent semantic indexing and TFIDF feature extraction as a feature extraction method for multi type classifier fusion.In view of the problem that the weighted voting method of multi type classifiers ignores the class information,a class weight classifier weight calculation method is proposed.The classification weight in this method is based on the classification accuracy of the training set in the classifier as the weight of the sample class,and then uses the weight of the sample to vote in the voting decision to achieve the text classification modeling.The experimental results show that the method used in this paper is more effective than the improved feature extraction method.It has a great improvement in accuracy,recall and F1 values,and also has a good classification effect on the unbalanced dataset and the data set under specific situations.It is also of support for text classification in other fields.
Keywords/Search Tags:text classification, feature extraction, sample weight, feature extraction method fusion
PDF Full Text Request
Related items