Research On Automated Text Categorization Based On RBF Network

Posted on:2004-08-26

Degree:Master

Type:Thesis

Country:China

Candidate:X Q Wang

Full Text:PDF

GTID:2168360092493490

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

In real world, most information available comes from text databases (or document databases), which consist of various data sources (e.g.: news, research papers, reports, books, magazines). Recently, with the rapid progress of computer techniques and Internet, a great quantity of electronic documents welling up everyday, which leads to the explosive increasing of the text databases' capacity. In order to utilize these immense text resources, we must organize them into categories according to their contents. This work will needs many experienced, professional people's taxing labor if it has to be done by manual, which apparent a long-period, costly and inefficient way and hardly cater to the explosion of information. Hence the issue of categorizing text automatically by computer has becoming more and more popular and important.Automated text categorization involves many theories and techniques, such as Statistics, Information Retrieval, Data Mining, Natural Language Understanding, Pattern Recognition and Machine Learning etc. so it is a task of great synthetic and challenging.Nowadays, most text categorization methods are based on similarity, that is, learning a class feature vector for each target class in learning stage, for a new text feature vector to be classified, compute similarities between each class feature vector and it, then return the class label that has highest similarity as the prediction result. However, there has two problems in this method as follows: Firstly, there has low similarity between texts in some classes, for example, although articles on Basketball and articles on boxing are both labeled with Sports class, the discrimination between them is apparently large; Secondly, one text can belongs to multi-classes, for instance, a paper on data mining may also talking about artificial intelligence.Based on above reasons, this paper introduces the issue of using RBF network for automated text categorization, the key idea is: Clustering texts of training set, and we can get some clusters that have high similarity inside one cluster and low similarity between pairs of clusters. Define a radial basis function (RBF) at center of each cluster and learning a two-layer neural network which consists of these RBFs, Simultaneously, for the purpose of avoiding over-fitting, we make use of ridge regression method, which adding a weight penalty term including a appropriate regularization parameter on the cost function and then lead to a more smooth function. This method tackles the first problem excellent because it considers the discrimination between different clusters in a class, as for the second problem of multi-classes, we can settle it easily too: if the output of more than one RBF networks exceed the threshold value, then we can say that the new text vector belongs to these accordingly classes simultaneously.We have obtained pretty results in our experiment of using RBF network for automated textcategorization, the accuracy of the predict result remains up to 90% steadily. However, there have still some unresolved problems: First, how to determine the number and size of the clusters automatically during the clustering process. Second, how to utilize the "local" ridge regression method which including multiple regularization parameters in learning RBF network. Third, those clusters in irregular form can't represented by radial basis function, thus we must find some other basis functions that can describe the irregular form. All these problems are to be settled in our future study.

Keywords/Search Tags:

Automated Text Categorization, Dimension Reduction, Radial Basis Function, Cluster, Regularization, Ridge Regression

PDF Full Text Request

Related items

1	Adaptation of closed form regularization parameters with prior information to the radial basis function neural network for high frequency financial time series
2	The Research Of Automatic Text Categorization System Based On Neural Networks
3	Research And Application Of Text Mining In The Patent Automatic Classification Based On Neural Network
4	Research On Establishing An Improved Method To Determine The Radial Basis Function Center Of Radial Basis Function Neural Network
5	Research On Application Of Radial Basis Function In Reverse Engineering
6	Study Of Distribution Regression Based On Stochastic Configuration Networks
7	A Dimension Reduction Method For Large-scale TExt Categorization
8	A Dimension Reduction Method For Large-scale Text Categorization
9	Automated Text Classification Model Based On Projection Pursuit Regression
10	Approximation Capabilities Of Sum-of-Product Neural Networks And Radial Basis Function Neural Networks