Font Size: a A A

Research On Feature Selection Based K-means Algorithm In Text Classification

Posted on:2015-09-13Degree:MasterType:Thesis
Country:ChinaCandidate:C ChenFull Text:PDF
GTID:2308330464468932Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Since the 21 st century information technology has been developed rapidly, the Internet gradually becomes a huge body of information. How do we manage these data and organization, high-speedy and accurately find valuable content is now a big problem in the field of information science and technology. The text classification technology is a key technology to solve these problems, which has a great practical value and widely attention. It includes a variety of technology; the feature selection technology is a key approach. It has an important significance in improving the operation speed, reducing the computation complexity and improving the efficiency of classification. This paper mainly research on feature selection algorithms in text classification.Feature selection is mainly divided into Filter method and Wrapper method. Filter method is fast, strong commonality, but low accuracy due to the independent of learning algorithm. Wrapper method of classification is high accuracy, but the computational cost is high with poor commonality. We often use two methods combined, to complement each other. Some Filter methods, like DF, IG, MI, ECE,CHI are commonly used for feature selection, coupled with Wrapper method in validation process. First, construct an evaluation function. Then, calculate each characteristic with the original collection, and select the first n characteristics. In Chinese text classification, the dimension of feature space greater than English one, so the mass statistical computing will spend huge computational cost.This paper proposes a new feature selection algorithm, which does not need to construct the evaluation function, by using the k-means clustering approach to feature selection, greately reduce the time of feature selection. The k-means algorithm based on maximum minimum principle, solve the problem of random initialization sample, combining Wrapper method and using a classifier performance to evaluate the selected feature subsets. The discussion about the k-means method in the experiment research of two kinds of calculation methods of the distance, the Euclidean distance and cosine distance effect on the algorithm, shows that using k-means algorithm for feature selection cosine distance is more suitable for application in text similarity calculation.At the same time, research discusses the k-means feature selection method in the cluster heart k value and characteristics of each cluster heart choose several v of the best values. Due to the differences between Chinese and English language, the optimal values of k and v has much to do with the types and size selection of corpus.This paper further researched new methods in text classification experiment, and the effect of feature selection combined IG, MI, ECE this feature selection methods with several common document frequency combination method. Combined with the Wrapper method using the BP Network, Naive bayes classifier and SVM algorithm training, this paper compared the performance of the classification result. Experimental results showed the k-means in Chinese text and English text feature selection algorithm is an effective feature selection method.This paper uses the Java language to fulfill text classification system, and do design verification software algorithm, on the basis of the effectiveness and feasibility of the method used for authentication. The system is mainly divided into three modules: text pretreatment, establishment of classifier model and model evaluation, classification of unknown text.
Keywords/Search Tags:text classification, Feature selection, k-means
PDF Full Text Request
Related items