Font Size: a A A

Research On Hierarchical Classification Methods For Chinese Texts And The Related Application

Posted on:2014-08-20Degree:MasterType:Thesis
Country:ChinaCandidate:Z K KongFull Text:PDF
GTID:2268330425956145Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the information for people to process has multiplied accordingly. Facing a great deal of complicated information like text, sound, video, etc. from the Internet, people are required to optimize getting information and knowledge and sorting them through certain models. Information process technology aims to filter valid information from a large sum of information and it has developed a lot during the past decade. The task of text classification is to find one or more similar texts in a predefined class or category and assign it or them in the category of texts to be classified. The common methods for text classification include machine learning and statistical method.However, the fact is that the everyday text categories exist in the form of hierarchical levels but not randomly, and this is neglected by the traditional classification methods. Then, tree data structure, which simulates a hierarchical tree structure with a set of linked nodes, has provided people with visual browse and search and reflected the semantic relationships between texts in documents. The core of tree data structure is to classify texts in terms of nodes, where classification starts at the root node, the member who has no superior. The text to be classified shall be compared with nodes at various branches, and eventually the classification ends when the text to be classified is categorized at the most similar branch node or nodes.Text classification technology has undergone the pure rule-based method to pure statistic-based method, and eventually the present incorporation of the two. As one of algebraic models of the present kind, vector space model or term vector model is for representing text documents as vectors of identifiers, but its focus on the text morphological forms and structures overshadows texts’ semantic relationships. Through Latent Semantic Indexing and Hidden Markov Model, this paper is to analyze the hidden semantic information in texts and to achieve their hierarchical classification. The contents of this paper include:(1) Introduction to the basic methods and key techniques of hierarchical text classification technology. Through the study of the relevant researches of home and abroad, the paper implies the deficiency of practical application in this technology, namely, the undervaluing of semantic relationships between texts and the great impact of noise on the classification results.(2) Proposal of a hierarchical text classification by valuing hidden semantic relationships between texts. The proposed text classification method assigns a theme to each category since the terms containing words relevant the theme outweigh others. As a Markov chain Monte Carlo (MCMC) algorithm, Gibbs Sampling is suggested in the paper to obtain a sequence of observations which are approximately from a specified multivariate probability theme distribution. Texts’ classifications are realized by these probability topics and the topic category labels are applied in the construction of the latent semantic classification model. The Latent Semantic Indexing put forward in this paper tries to explore the effect of topic labels in hierarchical text classification, and this method, as the experimental results suggest, turns out to increase dramatically the precision of text classification.(3) Suggestion on the text classification procedure based on the improved Hidden Markov Model. In the hierarchical text classification, categories of themes are differentiated according to the predefined hierarchical relationships. This method is to divide task into sub-problems, and build a corresponding classifier for each categorized theme, and eventually range these sub-problems into corresponding categories of different levels via the sub-classifiers. In this tree structure, the texts to be classified are only compared with the texts under nodes of the same level and the same branch with certain categorized topic. On the basis of hierarchical text classification, this paper tries to pose a construction of a Hidden Markov Models-based sub-classifier and its classifying procedure.(4) Application hierarchical text classification theory to the evaluation of information related with cyber crime and proposal of a prototype of this application, In addition, the current paper offers a criterion of theory frame for investigation of computer crimes based on semantic web. It also defines the rules for genuine construction of the prototype in the future.
Keywords/Search Tags:hierarchical text classification, feature extraction, Latent Semantic Indexingprobability theme, Hidden Markov Model
PDF Full Text Request
Related items