Font Size: a A A

Research On XML Text Categorization Based On Bayesian Classifier

Posted on:2011-12-24Degree:MasterType:Thesis
Country:ChinaCandidate:Z LiFull Text:PDF
GTID:2178360305955078Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Text categorization is the task of determine the category of text according the text content in a given categorization system. Text categorization is a research focus in data mining and information retrieval. XML text is a kind of semi-structured text, it is both normative and flexible. Therefore, after XML was proposed, more and more applications are based on XML. Today, there are a large number of places have texts of XML form, such as digital libraries, internet databases, and so on. People directly or indirectly contact with more and more data of XML form. In order to effectively manage these data, people often need to, first of all, classify these XML texts and thus research on XML text automatic categorization is important.Bayesian classifier theory is from probability theory, which is based on probabilistic reasoning. Simply speaking, Bayesian classifier would predict unknown sample into category with max posteriori probability. In the text categorization, due to huge number of feature items, if training and testing classifier using the feature items directly, it will need a lot of time and the categorization result will be influenced for too many irrespective feature items. So, on the one hand, there is a need for text feature items reduction, that is, find the most important feature items, and on the other hand, when classifying XML text using Bayesian classifier, usually the assumption"class conditional independence"is made, that is, given the value of class attribute of an instance, then the occurrence of an attribute of the instance is only relevant to class attribute, but has nothing to do with the other attributes. The Bayesian classifier which takes this assumption is called Na?ve Bayesian classifier. By using this kind of Bayesian classifier, the complexity of training and testing process is reduced. In the Na?ve Bayesian based XML text categorization, the most common situation is viewing XML text as plain text, that is, ignore the tree structure of XML text and select the words as feature items to train the Na?ve Bayesian classifier and carry out text categorization. However, this approach ignores structural characteristics of XML text.According to DOM, XML text can be viewed as a tree consisting of label nodes, attribute nodes and text nodes. Usually there is a relation between the text class and text structure and each type of text has its own unique style. The use of XML tree structure and combine text content and structure of XML will help to improve XML categorization result.According to the problems of Na?ve Bayesian XML text categorization, some improvements are made from the following aspects:First, according the tree structure of XML text, three kinds of feature items are used to represent XML text, and these three kinds of feature items are list, support branch structure and word. List refers to the sequence of nodes from root to leaf in XML DOM tree. Support branch structure refers to sub-structure consists of a non-leaf node and its direct child nodes of different name. Word refers to the word that contained in the leaf nodes of XML DOM tree. In this way, both the content characteristics and structural characteristics are considered. Among them, list and support branch structure reflect structural characteristics of XML text and word reflects content characteristics of XML text. Combining text content and structure characteristics will help improve categorization result.Second, based on the three kinds of feature items extracted,"three-stage dimension reduction strategy"is proposed. On the first stage, make use of statistical information of items, that is TF-IDF weights of feature items, to reduce dimension and select the feature items with TF-IDF larger than certain value. At this stage, the TF-IDF weight of each feature is calculated, and then sort these features according to their weights and finally choose features with larger weights. On the second stage, proposed the strong-related formula to find out feature items which are strong-related and merge them. The strong-related feature items refer to feature item pairs that the occurrence of one feature item has great impact on the occurrence of another feature item. On the third stage, reduce the number of word feature items according to structural features items. The essence of it is reducing dimension by using structure characterizes of XML. The idea of this method is that whether a text word is important is depend on its position. If it is under an important list, this word will be important, else it will not be important. By this way, the number of feature items is greatly reduced.Again, the importances of different feature items for XML categorization are different. In order to reflect their differences, the method for calculating the weights of feature items is proposed, and then integrate the weights into Na?ve Bayesian classifier and thus we get the structure-weighting based Na?ve Bayesian XML text categorization. The idea of feature weight is that if a feature is close to the root of XML DOM tree, the weight of the feature should be large, else it should be small. For SBS, the wider the SBS is, the weight should be larger.In addition, the feedback was studied. Feedback refers to improving classifier by testing results of classifier. In text categorization field, feedback is using texts which meet certain requirements and classified by classifier to train classifier in order to solve the problem of"lack of training texts". In this paper, we make use of test results to modify the weights of various feature items in order to improve categorization results.Experiments show that the precision of structure weighting based Na?ve Bayesian XML classification algorithm is better than traditional Bayesian XML text categorization's result and feedback based Na?ve Bayesian XML text categorization algorithm is even better. The data set used is from Wikipedia XML data set. This data set is widely used in XML IR, machine learning, text categorization and text clustering areas and it is adopted by INEX and it is an authoritative data set.How to understand the meaning of words and how to make use of the dynamic characteristics will be considered in future researches.
Keywords/Search Tags:XML, Na?ve Bayesian, Text Categorization, Feature Weight, Feedback Bayesian
PDF Full Text Request
Related items