Font Size: a A A

Research And Implementation Of Chinese Text Categorization Methods Based On Tree-like Keywords Set

Posted on:2016-10-31Degree:MasterType:Thesis
Country:ChinaCandidate:H Q LianFull Text:PDF
GTID:2308330479993911Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Nowadays, with the rapid development of information technology, information produced by industries is more and more complex, bringing people the “data explosion” challenge, which becomes a problem for users to quickly and accurately obtain the information they want. Text categorization is an effective way to solve such problems. In real-world applications, Chinese text categorization usually has a tree-like hierarchical structure. A document is classified into one or more of the most possible categories according to its content, which can provide more comprehensive and accurate information about the category information.In this paper, technology regarding text categorization is studied. One key section is feature selection. The ability to select the strongly distinguishable features for dimension reduction is directly related to the classifying accuracy. Thus, in this paper, the traditional mutual information feature selection method which does not perform well in low-dimensional space is improved in terms of word frequency and category distribution. Experiments show that the improved mutual information algorithm has its superiority.Meanwhile, this paper studies the hierarchical text categorization based on the tree-like keywords set. Every path composes of some category keywords, and the tree becomes a hierarchical category tree. This paper builds independent feature selection models for every category node to train every classifier in every node independently. Two categorization modes, namely, hard-decision and soft-decision, are studied. And experiments are conducted to compare the performance differences between the tree-like hard-decision categorization method and the flat categorization method. The soft-decision method in this paper uses a threshold to control the situation considering the accuracy and time cost. The node most similar with the document is reserved when classifying a document, while other nodes have to compare with this most similar value. If the ratio of them equals to or exceeds the threshold set, this node will also be reserved, otherwise, it is rejected. At last the categories corresponding to the higher posterior probabilities at the reserved leave nodes are chosen for the document. The softdecision method in this paper supports both the single-label and multi-label categorization.At last, this paper implements an automatic hierarchical text categorization system supporting flat categorization, tree-like hierarchical hard-decision categorization and softdecision categorization. Expert training module is also provided for system maintenance.
Keywords/Search Tags:automatic text categorization, feature selection, tree-like categorization, harddecision, soft-decision
PDF Full Text Request
Related items