Research And Implementation Of Chinese Text Categorization Methods Based On Tree-like Keywords Set

Posted on:2016-10-31

Degree:Master

Type:Thesis

Country:China

Candidate:H Q Lian

Full Text:PDF

GTID:2308330479993911

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

Nowadays, with the rapid development of information technology, information produced by industries is more and more complex, bringing people the “data explosion” challenge, which becomes a problem for users to quickly and accurately obtain the information they want. Text categorization is an effective way to solve such problems. In real-world applications, Chinese text categorization usually has a tree-like hierarchical structure. A document is classified into one or more of the most possible categories according to its content, which can provide more comprehensive and accurate information about the category information.In this paper, technology regarding text categorization is studied. One key section is feature selection. The ability to select the strongly distinguishable features for dimension reduction is directly related to the classifying accuracy. Thus, in this paper, the traditional mutual information feature selection method which does not perform well in low-dimensional space is improved in terms of word frequency and category distribution. Experiments show that the improved mutual information algorithm has its superiority.Meanwhile, this paper studies the hierarchical text categorization based on the tree-like keywords set. Every path composes of some category keywords, and the tree becomes a hierarchical category tree. This paper builds independent feature selection models for every category node to train every classifier in every node independently. Two categorization modes, namely, hard-decision and soft-decision, are studied. And experiments are conducted to compare the performance differences between the tree-like hard-decision categorization method and the flat categorization method. The soft-decision method in this paper uses a threshold to control the situation considering the accuracy and time cost. The node most similar with the document is reserved when classifying a document, while other nodes have to compare with this most similar value. If the ratio of them equals to or exceeds the threshold set, this node will also be reserved, otherwise, it is rejected. At last the categories corresponding to the higher posterior probabilities at the reserved leave nodes are chosen for the document. The softdecision method in this paper supports both the single-label and multi-label categorization.At last, this paper implements an automatic hierarchical text categorization system supporting flat categorization, tree-like hierarchical hard-decision categorization and softdecision categorization. Expert training module is also provided for system maintenance.

Keywords/Search Tags:

automatic text categorization, feature selection, tree-like categorization, harddecision, soft-decision

PDF Full Text Request

Related items

1	Research On Web Chinese Text Automatic Categorization Based On Rs-svm
2	Research On Web Chinese Text Automatic Categorization Based On RS-SVM
3	Research And Implementation Of The Automatic Chinese Text Categorization
4	Multi-class Scientific Literature Automatic Categorization System
5	Studies On Some Essential Problems In Automatic Text Categorization
6	A Study On Text Categorization Based On Machine Learning
7	The Research And Implementation Of Automatic Text Categorization For Chinese Web Documents
8	Research On Text Categorization Based On LDA And SVM
9	Research On Automatic Text Categorization System Based On Neuron Network
10	The Research Of Text Representation And Feature Selection In Text Categorization