Font Size: a A A

Research On Text Categorization Method Oriented To Content Security

Posted on:2008-11-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:B F ZhangFull Text:PDF
GTID:1118360242999237Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of Internet application technology,problems induced by information technique abuse in politics,economy,military,society,culture and so on have drawn more and more attention.The content security has become one of the basic issues in information security.Text categorization is one of the powerful means and key techniques for information organization,management,recognition and filtering,for which the need of the Internet content security poses new challenges.To ensure the security of information,abnormal content must be monitored efficiently and responded in time.So the fast and real-time inspecting of texts passing is necessary.Due to the variety and frequent movement of the content in Internet,it is difficult,perhaps impossible,to provide enough labeled samples of interest for the training of the classifier.This becomes the bottleneck in construction of a classification system.Therefore,semi-supervised learning method training with a few labeled and lots of unlabeled samples turned into a research hotspot.Variety of content and cross of topics also makes watchers from different areas pay attention to similar or even identical content.Multi-label learning appears to solve the above problem of an instance belonging to more than one class,and becomes a new research area.Aimed at the topic of the requiring background from the Internet content security, this dissertation studies three questions,namely,efficient training and prediction for text categorization,semi-supervised learning to alleviate the labeling bottleneck and multi-label text categorization.The main work and contributions of this dissertation are shown as follows:1.Efficient multi-class SVM learning method.A multi-class method cascading Rocchio with SVM is proposed.The Rocchio classifier filters most of the irrelevant class and enormously reduces the need of the judgments by SVM.The cascading method decreased the time of the 1 vs.1 and 1 vs.rest method for the test experiments by a quantity level respectively.A concise class-incremental multi-class SVM method CI-SVM is also presented.According to the experiment, the training time of the method was reduced and the testing efficiency was also improved significantly.2.Enhancement of the na(l|¨)ve Bayes classifier under class hierarchy.The performance of text categorization method of na(l|¨)ve Bayes highly depends on the global distribution of subjectively-selected sample correlating with classes.It can be enhanced by taking advantage of hierarchical characteristics and by introducing the conditional probability.This enhancement makes decisions in the local data belonging to child-classes of an internal class,thus lightening the influence of global data distribution and partially overcome the problem of date skewness. Experiments showed that the enhanced method improved the effectiveness of hierarchical categorization with na(l|¨)ve Bayes notably.3.Semi-supervised learning method based on self-training and EM integration. The method of integrating the training process of EM,which conservatively adjusts the label status for samples,and self-training,which labels the samples directly,is proposed.Two semi-supervised learning methods named ESTM and SEMT are provided.ESTM decisively labels some samples by the middle result in the iteration of EM,and SEMT substitutes the supervised na(l|¨)ve Bayes by semi-supervised EM method.Experiments demonstrated that ESTM and SEMT integrated the advantages of self-training and EM,and improved the classifier by unlabeled samples much more.4.Feature set splitting for co-training in text categorization.This dissertation presents the quantitative definition of the conditional independence of feature subsets given the class and suggests a strategy for splitting feature set locally in this sense.The property of holding independence when two groups of feature sets are united is also proven.Two methods respectively base on locally adaptive clustering and relevancy graph partitioning for feature set splitting in the precondition of independence are proposed.Applications to two data sets show that,using the feature divisions produced by our methods,the combined effectiveness of the co-trained na(l|¨)ve Bayes classifiers is improved by applying the unlabeled samples.As a result,the applicability of the co-training method is extended.5.Multi-label learning method based on label status vector(LSV).A Two-stage learning frame based on label status vector is proposed.It re-mines the multi-label information contained between label status values in the label status vector space (LSVS) of ranking methods.Under this frame,this dissertation presents the bag of labels(BOL) model in the kNN LSVS,proposes the Bayes method for that model and improves the ML-kNN method.In the na(l|¨)ve Bayes LSVS,linear least square fit(LLSF) for multi-label training and prediction is provided.The upper bounding of the Hamming training loss by the square of LLSF is also proven.Applications to 11 multi-label problems have shown that the two-stage frame and above learning methods was effective in multi-label classification.
Keywords/Search Tags:information security, content security, support vector machine, hierarchical categorization, semi-supervised learning, co-training, multi-label learning, feature set splitting, label status vector space
PDF Full Text Request
Related items