Font Size: a A A

Research Of Hierarchical Text Classification Methods Based On Category Structure

Posted on:2012-03-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:C L ZhuFull Text:PDF
GTID:1118330335985130Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text categorization is a key technology of text mining. Its main task is to assign a predefined category to a text based on its content by means of supervised learning. Text categorization has been widely used in natural language processing, information organization and management and so on. However, in recent years the scale of the category has become larger and larger in text categorization. For example, there are thousands of categories in YahoolDirectory and Open Directory Project (ODP) and so on. If these categories are organized in parallel, there are much more difficulties in assigning a correct category to a text and it needs to spend much time for user to find a category interested from them. Therefore, categories are usually organized hierarchically in real life. Because this hierarchical structure usually likes a tree, it is called category tree. Based on this category tree structure, some researchers propose some hierarchical text classification methods. It can not only help a user search and browse documents conveniently which can meet their behavior habit but also reduce the calculation work and improve the quality of classification by limiting the search scope.Because all categories are organized as hierarchical structure in hierarchical text classification, the feature that can distinguish the categories better at some level may work little on distinguishing the categories at some other levels. Additionally, the degree that the threshold reduces is difficult to grasp among the threshold reduction strategies which can reduce blocking. Furthermore, there are some relationships among categories in category tree. So are their training sets. These factors should not be ignored to feature selection, the training of hierarchical classification model and the determination of the classifier threshold. Moreover, the methods based on top-down level-based are attracted more attention by researchers because it can make better use of the information of the hierarchical structure comparing the big-bang method. But these methods suffer from the blocking problem. And this problem is unavoidable because of the limitation of the classifier itself. So how to make good use of some hierarchical structure information of the categories and samples and some potential information carried by blocking problem to reduce the effect of blocking and improve the quality of hierarchical classification is still an issue worthy of study.Major research contents and innovations of this dissertation include the following aspects.1. A text feature selection method for hierarchical classification is proposed.We propose a text feature selection method for hierarchical classification based on the existing feature selection methods. We coin two new concepts firstly, which are called category hierarchical correlative and category hierarchical non-correlative, according to the semantic relationships among categories in category tree. Then a mathematical method is proposed to measure the hierarchical correlation degree between two categories in category tree according to the hierarchical structure of category tree and the distribution of the training set of the category. Moreover, considering that the contribution to the category discriminative ability of a feature is different for the training set at different level and some categories are hierarchical correlative, we can get different importance degree for each category according to its level based on the method of calculating category hierarchical correlation degree proposed. Then we get the category correlative degree of a feature using the probability method. Finally, we calculate the discriminative abilities of features for category based on the previous computation. Experiments show that the new approach outperforms the traditional feature select methods on both the quality of the features selected and standard classification metrics in terms of accuracy, Fl and micro-precision.The innovation of this part is summarized as follows:(1) we coin a concept of category hierarchical correlation by analyzing the semantic relationship between categories in category tree and introduce a mathematical method to measure it. (2) We can choose different features for each classifier built in category tree according to the peculiarity of the hierarchical structure of category tree and the different contribution to the category discriminative ability of a feature for training set at different level. It explores a new way for feature selection in hierarchical text classification.2. A novel hierarchical text classification method based on global information of category tree is proposed.In top-down level-based method of hierarchical text classification, the error made by an upper classifier is to be reinforced at the lower classifier due to blocking problem. Based on the characteristic above we propose a new hierarchical loss function, which gives different punishment to the classifier generating blocking based on its level and its effect range. Then, in order to gain the goal of minimizing the hierarchical loss we introduce some potential information in blocking and some hierarchical structure information of categories and samples into boosting framework and improve the classification model by adjusting the quality of the training set at each iteration. Finally, we can get a better hierarchical classification model to reduce the blocking at upper nodes and improve the whole performance of hierarchical classification by assembling the classifier built at each iterative in boosting. Experiments show that the trained classifier outperforms the traditional AdaBoost in terms of accuracy, precision, recall, Fl and microPrecision. Meanwhile, it shows that the potential information in blocking is of help to training a better classifier and it can improve the classification results and reduce the chance of blocking occurring to some extent. The work can provide a reference for the use the blocking information.The innovation of this part is summarized as follows. It introduces hierarchical text classification into boosting framework and propose a new hierarchical loss function and a new method of updating samples weight by combining some hierarchical structure information of categories and samples and some potential information in blocking to improve the hierarchical classification model and enhance the whole performance of hierarchical classification.3. A new hierarchical text classification method based on backtracking algorithm is proposed. Considering that the contribution to a feature's discriminative ability for the sample at different level is different, we do feature selection by combining information gain method and some hierarchical information of the sample to get a better feature set which is suitable for hierarchical classification.In threshold reduction strategies, which can reduce blocking, the degree that the threshold reduces is hard to grasp. In order to determine a suitable threshold for each classifier we analysis the characteristic of the distribution of the training set in each category. And by combining the connection between categories we divide the training set of the classifier built at a category node into three parts to train the KNN classifier. We can get a value range of the threshold for each classifier that can provide a basis for the threshold selection. Then, we can get a candidate category set for a test document using backtracking algorithm and we determine the final category by the distance between the document and the centroid of each candidate category. Experiments show that this method proposed can reduce the blocking at upper level and its performance is over traditional KNN method.The innovation of this part is summarized as follows:(1) we improve the IG method according to the hierarchical distribution of categories and samples, which made the features selected more suitable for hierarchical text classification. (2) On the basis of threshold reduction strategy, we propose a new method of determining a value range of threshold by combining the distribution of training set at a category and KNN method. Moreover, it can get a candidate category set for a document using backtracking algorithm and get the final category by the distance between the document and a candidate category.In summary, the thesis is developed around the hierarchical structure of the category and it introduces some further studies on feature selection, the use of the blocking information and the blocking reduction strategy according to the characteristic of hierarchical text classification. And some experiments validate it. The research of the thesis enriches the content of hierarchical text classification and explores a new way to make better use of the hierarchical information of category tree.
Keywords/Search Tags:Hierarchical Text Classification, Category tree, Feature Selection, Blocking, Backtracking
PDF Full Text Request
Related items