Font Size: a A A

Research And Implementation Of Text Classification Algorithm

Posted on:2017-03-10Degree:MasterType:Thesis
Country:ChinaCandidate:M ShiFull Text:PDF
GTID:2308330485464141Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the booming development of the Internet technology and popularization of its application, we have entered the era of information overload. On one hand, such a large information base can meet users’needs of various types of information, and on the other hand, the content of information base is too complicated. Accurately retrieving information you required has become a problem. To solve this problem caused by information overload, text classification techniques have been proposed which can make users more quickly and easily obtain the required information. Text classification can not only distinguish the categories of new information according to the information which is marked categories, but also can effectively process and organize vast amounts of Internet information.At present, there are many improved methods for the classification performance, however, with the rapid expansion of information base, when we execute classification algorithm, we face the problem of how to find out the representative data quickly and accurately. The feature selection and feature weighting operations have a direct impact on this problem, so this paper intensive studies the process of these two phases, and proposed two improvements:feature selection based on ant colony algorithm, a improved feature weighting method by combing prior information of categories and distribution of features.The main work of this paper can be summarized as follows:1. With emphasis on some typical feature selection and feature weighting methods and analyze their strengths and weakness.2. For the problems of initial feature space dimension excessively high, and the initial feature set contains a large number of associated features and redundant features, this paper combines the ant colony algorithm and feature selection, and adopts the feature selection based on ant colony algorithm. By studying the evaluation function, probabilistic transition rule and pheromone update rule, and using local search mechanism, thereby effectively excluding the associated features and redundant features.3. As the fact that the traditional TF-IDF feature weighting method simply considers term frequency, while ignoring the influence of classification result caused by the prior information of categories and the feature distribution on the entire training set, the paper improves traditional TF-IDF method in two steps:Firstly, combining TF-IDF weighting method and TF-RF weighting method and proposed TF-RFIDF feature weighting methods; Secondly, based on the method of TF-RFEDF, the information distribution entropy parameters of within category and among different categories is introduced according to the concept of entropy, thereby proposes an effective feature weighting method TF-RFIDFimp, and makes further improvement on the features weight accuracy. Experimental-results show that-the-improved algorithm precision, recall and F-Measure has increased comparing to traditional methods. Further confirming the improved algorithm can improve the performance of text classification.
Keywords/Search Tags:Ant Colony Algorithm, Feature Selection, TF-RF, Information Distribution Entropy, Feature Weighting
PDF Full Text Request
Related items