Font Size: a A A

On Text Classification And Its Applications

Posted on:2009-05-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:X L HaoFull Text:PDF
GTID:1118360272489292Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Internet is imbued with various informations,some of which,such as terrorism, threaten the security of sovereignty.Traditional techniques to block information according IP address or theme are out of date.Now,the state of the art is to monitor the content of the information.Because text is main representation of information,many techniques to monitor information depend on the understanding of text.Text classification and clustering are key techniques.Explosive increase of text information poses new challenge to text understanding and requires that text understanding be quicker,more efficient,and more accurate.In this paper,three challenges in text categorization are explored,i.e.,class imbalance,feature selection and bottleneck of annotation.To improve the speed and accuracy of classification,several methods and techniques are presented.Meanwhile, topic detection and tracking,an important application of text classification and clustering is discussed.Our main contributions are,1.One strategy to deal with class imbalance in kNN classificationClass imbalance is one of problems plagued the community of data mining. Performance of kNN,a widely used algorithm in text categoryization,deteriorates when distribution of training data is skewed among different classes.When used in a project of text content security,kNN classified almost all test samples of minority classes into majority ones.To overcome this defect,critical point(CP) of training set is proposed. Traditional decision functions of kNN are revised by LA or UA,approximate value of CP.This is so-called adaptive kNN with weight adjustment.Experiments on bised data sets shows that adaptive kNN with weight adjustment outperforms traditional kNN and random resampling and gets better results.2.Selection of training samplesSelection of training samples is vital for a classifier to build.Atypical samples not only increase the time of training but also introduce noise into training set.As an instance based algorithm,kNN classifier has large computational requirement and space cost.Meantime,imbalance distribution of training data will lead to bad performance of kNN classifier.To deal with these defects,MultiEdit and Condensing algorithms are firstly modified,then sampling based on feature selection and Condensing is proposed. First,several traditional methods of feature selection are combined to form features for each class.Second,redundant cases are removed by combination of class features contained in cases with Condensing algorithm.Exaustive experiments show that the size of training set decreases sharply,which leads to reduction in space and time cost and improvement in classification quality.3.Semi-supervised text categorizationSemi-supervised categorization is a kind of special categorization.Tradtional classifiers only train with labelled data,but labelling data is a difficult task because it is expensive and time-consuming.Labelling data is dull and requires experienced annotators to label them with plenty of time and special device.This is so-called bottle-neck of annotation.At the same time,unlabelled data are easy to obtain and can be used in diverse ways.Semi-supervised learning algorithm builds good classifiers with labelled data and lots of unlablelled data to solve the bottle-neck of annotation. Because semi-supervised learning needs less manual work,it is important both in theory and in practice.Two-phase co-training based on kNN and SVM is proposed after we examine existing semi-supervised learning.Experiments show that the given method is effective.Meantime,we discuss a practical application of text classification and clustering technology——topic detection and tracking oriented to BBS.From the point of view of text mining,topic detection is similar to text clustering and topic tracking is similar to text categorization.Topic detection and tracking(TDT) aims to organize and deploy multi-language news from various news agents according to topic.This technique is a must in appications,such as automatically monitoring information sources,for instance,radio and TV,and recognizing unexpected events, new events and new information about exsting events.It can be widely used in information security and analysis of securities business.In addition,TDT can be used to dig out all news some user interested in and discover the evolution course of a specific topic.On the basis of survey on TDT,we develop a TDT system oriented to BBS.We apply the above results into a prototype system on text content security.
Keywords/Search Tags:Text Classification, Topic Detection and Tracking (TDT), Class Imbalance, Information Filtering, Samples Selection, Semi-supervised Learning (SSL), BBS
PDF Full Text Request
Related items