Font Size: a A A

Research On Two-stage Hierarchical Text Classification Model Based On Neighbor-assistant Strategy

Posted on:2017-07-25Degree:MasterType:Thesis
Country:ChinaCandidate:C Y WangFull Text:PDF
GTID:2348330509953998Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Traditional text classification methods have achieved very good classification results when the number of categories is small. However, with the scale of categories bigger, such as LookSmart, ODP etc., it becomes more difficult to assign one category to another correctly with all categories are organized in parallel. Therefore, these categories are usually classified into hierarchical structure firstly. Based on hierarchical structure, scholars introduced hierarchical text classification method, such as Big-bang and Top-down commonly used. Big-bang classification method is a solution with a classification-machine used on all categories, resulting in an unbearable training time. Top-down classification method can better use the information provided by the hierarchy structure compared with Big-bang, and however, this method will lead to a block problem. In conclusion, these drawbacks indicate traditional hierarchical text classification can't be applied to the large-scale hierarchical text classification works.Two-stage hierarchical text classification model proposed recently is an effective method to solve the large-scale text hierarchical classification problem. Compared with traditional hierarchical text classification, it is a significant improvement in both the time efficiency and classification effect. However, there are still a lot of problems in the classification process. Therefore, a new two-stage hierarchical text classification model based on neighbor-assistant strategy(THTC-NA model) model is proposed by improving the traditional THTC model.In this paper, the main contents are as follows:The method and application of hierarchical text classification are systematic studied, and a new THTC model(THTC-NA) is proposed. THTC-NA model consists of two stages: searching and classification. In order to reduce the size of the data,the searching stage is to extract candidate categories of documents to be classified by introducing search strategy based on the category and prune the original hierarchical structure. And the candidate categories are organized into a hierarchy using top-down method. Only in this way can the hierarchy remains unchanged computed to the original category hierarchy, so that we don't have to train a special classifier for each document to be classified.The classification stage uses the classification results of the neighbor nodes of each node in the category hierarchy to assist in the classification decision of the node, and the confidence level is proposed to solve the problem of unknown neighbor node reliability. At the same time, the global search is done by the hierarchical path, which avoids the local optimal trap caused by the single node miscarriage of justice. Experiment on data set Newsgroups-18828 shows that that neighbors' classification result is helpful to identify the categories of documents.On the issue that every document to be classified needs a specified trained classified-machine, THTC-NA model is proposed to use the Top-down method to organize candidate categories into hierarchical structure to keep the position unchanged in the large-scale categories. Therefore, this method can only need a classified-machine for a level in the hierarchical structure of large-scale categories.THTC-NA model with a weighted category searching stage and classification stage results is proposed to solve the problem of THTC model that searching stage results is not taken full advantage of. The experiment result indicates that the method of combining the two-stage results to identify the document to be classified is better than the method of using the only classification stage results.
Keywords/Search Tags:two-stage hierarchical, text classification, neighbor-assist strategy, confidence level
PDF Full Text Request
Related items