Font Size: a A A

A Study Of Text Classification Algorithms Based On Feature Selection

Posted on:2019-06-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:X J HuFull Text:PDF
GTID:1368330572950442Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With various new media such as E-mail,Micrioblog,We Chat and online shopping platforms gradually deepening people's social life,various styles of electronic information resources such as texts,images,videos,and voices are growing at an explosive rate.When people are exposed to more kinds and diverse information,they are also faced with the problem that the information data is too large to capture the most needed information.How to organize,manage,and store these data more efficiently and how to accurately and rapidly search,analyze,and mine information that meets people's needs have become a challenge in the field of computer science.In this paper,we studied and discussed the research status and related theoretical techniques of text classification and sentiment classification and analyzed the current problems and difficulties in these fields.From the perspective of improving the performance,efficiency,and user personalization of text classification and sentiment classification,we firstly proposed a feature selection method,which is used to feature dimension reduction in massive text collection.Secondly,using the key feature selection technology,machine learning methods and user interest sets construction method,we proposed a two-class spam classification method for spam filtering and a microblog news classification method for microblog news recommendation.Finally,we proposed a microblog short text sentiment classification method to solve the problem of subjective attitude determination in the microblog text for sentiment analysis.The main research contents and innovation work of this article are summarized as follows:1.For the problem of eliminating noise data in text classification,we proposed a parallelized noise elimination algorithm in this paper that can perform feature reduction in two stages.With various redundant and erroneous noise data in mass text classification,we discriminated and processed noise characteristics,noise categories,and noise classifiers to reduce feature dimensions and improve text classification performance.Firstly,we carried out the first feature selection of the text vector using an improved Principal Component Analysis(PCA)method.Secondly,according to the word frequency and document frequency of each feature,the TF-IDF method is used to perform secondary feature selection.These two feature selections are mainly filtered redundant noise features.At the same time,use Map Reduce to parallelize the algorithm.Finally,a text classification algorithm based on parallel processing of Map Reduce is proposed to detect and delete the error noise features.Experiments were executed on two commonly used datasets,Reuters-21578 and 20 Newsgroup.The results show that compared with other noise data elimination algorithms,the proposed algorithm in this paper can reduce the running time of text classification and improve the noise feature elimination rate and the classification accuracy of text classification.Especially when the noise ratio of texts decreases,the algorithm can still maintain good and stable classification performance.2.To better solve the spam filtering for users requires,based on the selection of key features,we constructed the user interest sets and proposed a two-category spam classification algorithm using active learning and negative selection.Firstly,we used the feature selection method based on binomial assumption to reduce the feature dimension of the spam dataset and utilized the obtained labeled key features to establish the user's positive and negative interest sets.Secondly,the bi-direction user interest sets are used to improve the detector in the negative selection algorithm,and the improved negative selection algorithm is used to enhance the sampling engine in the active learning method.Use the two interest sets are used to improve the classification performance of the spam dataset.Finally,the key features that cannot be identified by the bi-direction user interest sets are restored to the mail text and then output for the user's labeling,which greatly reduces the number of user labeling.Experiments are performed on six common spam datasets PU1,PU2,PU3,PUA,Lingspam and Spambase.The results show that,under three classification performance evaluation criteria,the proposed algorithm has a better performance and a lower user labeling quantity.Converting the user's personal preferences into the bi-direction user interest sets helps improve the performance of the classification algorithm.3.To solve the low accuracy and poor recommendation diversity in personalized news recommendation,we proposed a personalized news recommendation algorithm based on bi-direction user interest sets in this paper.Firstly,we used active learning method to select the key features of texts from the user's reading history.Secondly,based on the obtained key feature set,we established a positive user interest feature set and a positive user interest category set to simulate the preference of users.The negative use interest category set is built by the positive user interest category set.The anomaly detection mechanism is used to ensure the diversity of news recommendations.Finally,we select real news corpora as experimental material and compare the proposed algorithm with representative news recommendation methods on accuracy,running time,and diversity.The results show that,in addition to guaranteeing a stable recommendation accuracy,the proposed method has higher recommendation diversity.4.Text sentiment classification is an important topic in the field of natural language processing(NLP)and text mining.Automatically identifying different types of sentiment can benefit many NLP systems,such as commentary digests and public media analysis.In this paper,we proposed a microblog sentiment classification method based on transductive transfer learning(TTL)method for the sentiment analysis of microblog text.Firstly,an emoticon polarity list is generated using the emoticons in the microblog text,and a new key feature selection method is designed for the microblog text set based on the principal component analysis(PCA)method.Secondly,we built a new sentiment lexicon SL using the existing traditional sentiment dictionaries.Using the TTL method,the sentiment lexicon SL is set as the source domain,and the unlebaled microblog texts are set as the target domain for emotional classification processing.Finally,independent experiment and actual experiment are executed.The experimental results show that,compared with other sentiment classification methods,the proposed algorithm is feasible and can achieve better sentiment classification performance.
Keywords/Search Tags:Text mining, text classification, sentiment analysis, sentiment classification, key feature selection, bi-double direction user interest sets, machine learning
PDF Full Text Request
Related items