Font Size: a A A

The Text Categorization Algorithm Based On Nearest Subspace Search

Posted on:2015-12-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2298330452453402Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Text categorization is a process of supervised learning, refers to using thecomputer to set automatically assigned tag according to certain standards of text,involving many technology in the field of machine learning and data mining, mainlyincluding text representation, feature selection, classification model and evaluationmethod, etc. At present, the commonly used methods are Naive Bayes text classifier(Naive Bayes), support vector machine (SVM), K nearest neighbor (KNN), etc.Nearest neighbor method is a special case of the KNN, the basic idea is to find thenearest neighbor sample of the test sample from the training sample set, and thenjudge category of the test sample according to the category of the nearest neighbor toit. But the nearest neighbor method classifier according to the principle of the closest,classification accuracy is susceptible to interference of the noise data. And that, if thetraining set document number is larger, the new samples needs large computationoverhead, leading to the classification process more slowly. In this paper, we dofurther research on the basis of the existing text classification based on nearestneighbor algorithm, the main research work includes the following aspects:1) We study different methods of feature extraction and feature study ofempowerment. This paper introduces the common feature extraction methods, and tryto use the method of combined feature extraction, which combining the documentfrequency with chi-square checking feature extraction methods, by comprehensive useof the advantages of the DF and CHI, and better to select key words, and betterclassification effect is obtained. At the same time, in the use of combined featureextraction, we also have carried on the contrast experiment in view of the DF valuerange selected on the effects of the classification. In addition, for two-class textclassification problem, we used a new method, namely term frequency relevancefrequency product. It improves classification results significantly.2) We study nearest subspace searching model. Nearest subspace searching is anewly proposed model analysis method, the basic idea is to select a set of vector torepresent the important information of similar or related data, then the set of vectormap into a point in the higher dimensional space, and then to solve the issues aboutnearest neighbor problem in high dimensional space. In the process of using thenearest neighbor search subspace model, there are two key problems needed to payattention to. The first is how to present the text information as a form of subspace. Thesecond is how to convert the problem of nearest neighbor search subspace to the nearest neighbor search problem. In view of the first problem, we use the vector spacemodel to represent text space, which using a vector to present a text, and then using amatrix to present a set of text in the same category. After that, we decompose thematrix by use of singular value decomposition method, then its characteristic matrixcan be obtained, which is the text feature subspace. In terms of the second problem,we define a set of mapping functions, mapping subspace and query set respectivelyinto points in a higher dimensional space, thus do nearest neighbor search in higherdimensional space.3) We apply the nearest neighbor subspace search algorithm to the research oftext classification. Because of the large amount of experimental data, considering theneed to take up a lot of memory in the process of classification and the text dimensionafter feature extraction is still bigger compared to the memory used in classificationprocess. So before using the nearest neighbor subspace search algorithm to classifier,we reduce the sample space dimension through principal component analysis. Anumber of experiments were executed on the traditional nearest neighbor search andthe nearest neighbor subspace search to text categorization. We do experiments onReuters-21578data sets and the results show that our method can effectively improvethe performance of text categorization which has higher accuracy and recall rate andvalue of F1.
Keywords/Search Tags:text categorization, nearest neighbor search subspace, feature selection, feature empowerment
PDF Full Text Request
Related items