Font Size: a A A

Research And Application Of Feature Dimension Reduction Algorithm In Text Classification

Posted on:2019-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:N N LiuFull Text:PDF
GTID:2348330569495539Subject:Engineering
Abstract/Summary:PDF Full Text Request
In recent years,the development of text classification technology is facing a serious challenge which is resulted by the high dimensionality and sparseness of text data duo to the expanding of Internet.Therefore,the algorithm of data feature reduction becomes one of popular research fields for facing explosive data growth problem.It is an important role in the optimization of text classification technology with the feature reduction,which can select or extract feature subsets with strong class correlation and small redundancy between feature sets from the feature set to reduce feature space dimensions.Feature reduction can be divided into three categories,Filter,Wrapper and Embedding.The Filter method has high computational efficiency and simple feature evaluation model,but only focuses on a single feature,ignoring the possibility that combining different features may bring better results.Although the Wrapper method can generate feature sets with high value for classification accuracy,it is difficult to obtain a wide range of applications with its high computational cost.A feature-based dimensionality reduction algorithm based on cluster validity indices is proposed,named WB-Index Sequential Forward Selection(WBI-SFS),with the research of the application of cluster validity indices in text classification.It is a kind of Filter method since the WBI-SFS algorithm does not rely on any specific classifier to evaluate feature subsets.The WBI-SFS algorithm not only has the short-term overhead feature of the Filter method,but also has high classification accuracy.The innovations of WBI-SFS are as shown below.First,the traditional Filter or classification algorithms is replaced by efficient and linear validity indices as a measure of feature subsets evaluation,and the cross-validation process based on classifiers in the Wrapper method is also replaced.With that,the computational cost is reduced.Second,it combines the sequence forward search method to traverse the whole set and iteratively generates candidate feature subsets.The ergodic search method has a simple theory,a wide range of applications,and a very good universality.Combining the WB-index with a specific search method can solve the time-consuming problem of searching for the optimal feature subset and the iterative evaluation feature subset in the high-dimensional data sparseness problem.In this thesis,after several experiments on two different types of data sets,it is further proved that the WBI-SFS algorithm perform better on classification and efficiency in both text-based data sets and non-text data sets.Finally,based on the WBI-SFS algorithm,a network content recognition prototype system for network traffic analysis,traffic cleaning,content recognition and filtering based on unified strategy and application rules,called “Net Cloud” network purification system,is designed and implemented.The core function of the system is to automatically identify,classify,filter,and block webpages that contain unhealthy information,so as to guide minors to use the network properly and resist intrusion of harmful external information.
Keywords/Search Tags:Text classification, Feature reduction, Cluster validity index, Heuristic algorithms
PDF Full Text Request
Related items