Research And Application Of Feature Dimension Reduction Algorithm In Text Classification

Posted on:2019-01-31

Degree:Master

Type:Thesis

Country:China

Candidate:N N Liu

Full Text:PDF

GTID:2348330569495539

Subject:Engineering

Abstract/Summary:

In recent years,the development of text classification technology is facing a serious challenge which is resulted by the high dimensionality and sparseness of text data duo to the expanding of Internet.Therefore,the algorithm of data feature reduction becomes one of popular research fields for facing explosive data growth problem.It is an important role in the optimization of text classification technology with the feature reduction,which can select or extract feature subsets with strong class correlation and small redundancy between feature sets from the feature set to reduce feature space dimensions.Feature reduction can be divided into three categories,Filter,Wrapper and Embedding.The Filter method has high computational efficiency and simple feature evaluation model,but only focuses on a single feature,ignoring the possibility that combining different features may bring better results.Although the Wrapper method can generate feature sets with high value for classification accuracy,it is difficult to obtain a wide range of applications with its high computational cost.A feature-based dimensionality reduction algorithm based on cluster validity indices is proposed,named WB-Index Sequential Forward Selection(WBI-SFS),with the research of the application of cluster validity indices in text classification.It is a kind of Filter method since the WBI-SFS algorithm does not rely on any specific classifier to evaluate feature subsets.The WBI-SFS algorithm not only has the short-term overhead feature of the Filter method,but also has high classification accuracy.The innovations of WBI-SFS are as shown below.First,the traditional Filter or classification algorithms is replaced by efficient and linear validity indices as a measure of feature subsets evaluation,and the cross-validation process based on classifiers in the Wrapper method is also replaced.With that,the computational cost is reduced.Second,it combines the sequence forward search method to traverse the whole set and iteratively generates candidate feature subsets.The ergodic search method has a simple theory,a wide range of applications,and a very good universality.Combining the WB-index with a specific search method can solve the time-consuming problem of searching for the optimal feature subset and the iterative evaluation feature subset in the high-dimensional data sparseness problem.In this thesis,after several experiments on two different types of data sets,it is further proved that the WBI-SFS algorithm perform better on classification and efficiency in both text-based data sets and non-text data sets.Finally,based on the WBI-SFS algorithm,a network content recognition prototype system for network traffic analysis,traffic cleaning,content recognition and filtering based on unified strategy and application rules,called “Net Cloud” network purification system,is designed and implemented.The core function of the system is to automatically identify,classify,filter,and block webpages that contain unhealthy information,so as to guide minors to use the network properly and resist intrusion of harmful external information.

Keywords/Search Tags:

Text classification, Feature reduction, Cluster validity index, Heuristic algorithms

Related items

1	Research And Application Of Text Classification Based On Heuristic Algorithm
2	Research On New Cluster Validity Index For Overlapping Datasets In Cluster Analysis
3	Class Equality Cluster Validity Index And Cluster Filter K-Means Algorithm
4	Research On The New Validity Index Of Internal Clustering And The Method To Determine The Optimal Cluster Number
5	Research On Connectivity-based Cluster Validity
6	Research Of New Clustering Validity Index In Cluster Analysis
7	The Research And Comparative Analysis Of Cluster Validity Index
8	Research Of Network Service-based Classification Technology And Platform Building
9	Research Of Fuzzy Clustering Algorithm And Cluster Validity Index
10	A Cluster Validity Index Based On Binary Tree Nearest Neighborhood