Font Size: a A A

Research On Text Classification Based-on Support Vector Machine

Posted on:2010-10-26Degree:MasterType:Thesis
Country:ChinaCandidate:B FanFull Text:PDF
GTID:2178360275980496Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Internet regard as an open information spaces,has a speedy development recent years.It has already been an effectual platform which people can transmit and dispose information on it.But owing to the rapid increase of a large quantity of information on the Internet,people are difficult to speedily and efficaciously get the needful information.For the sake of helping rapidly the users find the needful information and taking efficaciously advantage of this information,it is necessary to classify,arrange and manage.The information of texts has a large proportion of the web information,so the research about the technique of the automation text classification looks important especially.Statistics theory is the theory of the machine learning rule in case of specially studying the small sample.Support vector machine is the method of machine learning on the basis of the statistics theory.It overcame a lot of shortcomings both the nerve network classification and the traditional statistical classification,and had higher generalization performance.This thesis regards the process of the automatic text classification as the thread.On the basis of the deep-seated study of the text representation,the feature extraction,the feature reconstruction and the classification algorithm, presents a classification algorithm of the web pages which bases on the least square support vector machine and the latent semantic analysis.At first,we studied the algorithm of features extractoin of web pages.Web page data are Semi-Structured data different from text data.In web page expression,the weight of every feature is influenced by two factors.One is the frequency that the feature appears in the HTML document;the other is the place where the feature is in this HTML document.On the basis of studying the algorithm of text freature extraction,we improve the algorithm of web page feature extraction and weighting according to the particularity of web page feature.Latent semantic analysis gets the latent semantic structure of the term-document matrix with singularity value decomposition,to a certain extent,which settles the polysemous and synonymous problem.Least square support vector machine has high learning efficiency on the large dataset,speciously under the circumstances of obtaining the label sample costly.Adopting the algorithm of a novel web page feature weight and utilizing the summarization algorithm clear the web page noise,and improve the accuracy of the web classification.Finally,we gain the Chinese corpus from the Internet,which has 12,684 Chinese documents,including which 9000 articles apply to train and 3,684 documents apply to test,the algorithm is verified and gain a more classification effect,as well as the improvement of the algorithm is effective.
Keywords/Search Tags:text classification, support vector machine, feature selection, feature reconstruction, Transductive Learning Algorithm
PDF Full Text Request
Related items