Font Size: a A A

Researching And Application Of Multi-hierarchy Text Classification Technology

Posted on:2012-04-09Degree:MasterType:Thesis
Country:ChinaCandidate:L Y YuanFull Text:PDF
GTID:2178330332486037Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the number of different industries show exponential growth of information, to facilitate information organization and management, the need for massive electronic information data classified according to their contents, so the automatic text classification techniques put forward higher requirements, the automatic text classification have gradually moved from the experimental stage to practical application. Current research and application of the automatic text classification techniques is largely concentrated in flat text classification, predefined categories are treated in isolation and there is no structure defining the relationships among them. However, in practice, multi-layer text classification has more in line with the relationship between the text, then the text also has higher positioning accuracy. In addition, the multi-layer text classification make the large classification problem into smaller classification problems, can reduce the complexity of reading time and space, making the classification of the complexity of the classification algorithm used is still possible to obtain good classification results.The public security system has numerous and varied cases and there is a clear relationship between them. The same case can be classified according to different angles to multiple categories. For example, traffic accident can be classified as ordinary traffic accident case, it also can be classified as cases of intentional assault, which is the classification of the cases, it is necessary to take into account the motivation and also take into account the results of the factor. Therefore, this paper focuses on the multi-layer text classification technology, and design a multi-layer text classifier based on vector space model to achieve massive cases of Police automatic classification. The specific research works are as follows:(1) We analysis of the characteristics of case information, comprehensive utilization of multiple technologies transform the case information text to the form of data can be directly applied to text classification, such as word segmentation,feature extraction and dimension reduction and the text representation method.(2) In order to capture the characteristics of each category applied to the classification task, wc propose a multi-layer text classification feature selection method based on multi-feature extraction technology. (3) According to the idea of center vector text classification algorithm, we represent text and category (some sort of text) with the VSM, and propose text classification algorithm by calculating the association degree of text and category to do the multi-layer classification. This paper gives calculating the association degree method and deeply analysis the weight formula of VSM, proposing a new weight formula based on the combination of on document frequency (DF, Document Frequency) and mutual information (MI, Mutual Information). After comparing the new weight formula with the TF-IDF weight formula, we find that the new weight formula performances better in text classification.(4) This paper analysis the shortage of a flat text classification performance evaluation directly applied to hierarchical text classifier performance evaluation and proposes a performance evaluation of misclassification error distribution and distance. Comprehensive The evaluation method and flat text classification performance evaluation can not only more accurate assessment of the performance of multi-layer text classification, and can be used to guide the training of classifiers to further improve the classification performance of classifier.(5) For text of case information cross-serious causes the question of weak classification for some certain categories, we propose an confusion class identification technology based on Clustering Algorithm. Then, by eliminating the same features between the confusion categories, the performance of classifier to be improved.
Keywords/Search Tags:Multi-hierarchical Classification, Classification Algorithm, Multi-feature Selection, Classifier Performance Evaluation, Confusion Class Discrimination
PDF Full Text Request
Related items