Font Size: a A A

Research On Automated Text Categorization Based On RBF Network

Posted on:2004-08-26Degree:MasterType:Thesis
Country:ChinaCandidate:X Q WangFull Text:PDF
GTID:2168360092493490Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In real world, most information available comes from text databases (or document databases), which consist of various data sources (e.g.: news, research papers, reports, books, magazines). Recently, with the rapid progress of computer techniques and Internet, a great quantity of electronic documents welling up everyday, which leads to the explosive increasing of the text databases' capacity. In order to utilize these immense text resources, we must organize them into categories according to their contents. This work will needs many experienced, professional people's taxing labor if it has to be done by manual, which apparent a long-period, costly and inefficient way and hardly cater to the explosion of information. Hence the issue of categorizing text automatically by computer has becoming more and more popular and important.Automated text categorization involves many theories and techniques, such as Statistics, Information Retrieval, Data Mining, Natural Language Understanding, Pattern Recognition and Machine Learning etc. so it is a task of great synthetic and challenging.Nowadays, most text categorization methods are based on similarity, that is, learning a class feature vector for each target class in learning stage, for a new text feature vector to be classified, compute similarities between each class feature vector and it, then return the class label that has highest similarity as the prediction result. However, there has two problems in this method as follows: Firstly, there has low similarity between texts in some classes, for example, although articles on Basketball and articles on boxing are both labeled with Sports class, the discrimination between them is apparently large; Secondly, one text can belongs to multi-classes, for instance, a paper on data mining may also talking about artificial intelligence.Based on above reasons, this paper introduces the issue of using RBF network for automated text categorization, the key idea is: Clustering texts of training set, and we can get some clusters that have high similarity inside one cluster and low similarity between pairs of clusters. Define a radial basis function (RBF) at center of each cluster and learning a two-layer neural network which consists of these RBFs, Simultaneously, for the purpose of avoiding over-fitting, we make use of ridge regression method, which adding a weight penalty term including a appropriate regularization parameter on the cost function and then lead to a more smooth function. This method tackles the first problem excellent because it considers the discrimination between different clusters in a class, as for the second problem of multi-classes, we can settle it easily too: if the output of more than one RBF networks exceed the threshold value, then we can say that the new text vector belongs to these accordingly classes simultaneously.We have obtained pretty results in our experiment of using RBF network for automated textcategorization, the accuracy of the predict result remains up to 90% steadily. However, there have still some unresolved problems: First, how to determine the number and size of the clusters automatically during the clustering process. Second, how to utilize the "local" ridge regression method which including multiple regularization parameters in learning RBF network. Third, those clusters in irregular form can't represented by radial basis function, thus we must find some other basis functions that can describe the irregular form. All these problems are to be settled in our future study.
Keywords/Search Tags:Automated Text Categorization, Dimension Reduction, Radial Basis Function, Cluster, Regularization, Ridge Regression
PDF Full Text Request
Related items