Font Size: a A A

Research On Text Classification Of Web Text Mining

Posted on:2008-09-22Degree:MasterType:Thesis
Country:ChinaCandidate:J B TangFull Text:PDF
GTID:2178360215979844Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet Technology, the information exploring era is coming. Quickly and accurately obtaining what users need on WWW is getting more and more difficult because of Internet's opening and dynamic and heterogeneity, how to obtain valuable information becomes a research hotspot now. Web text mining technology is a method to solve the above question. It can find latent and valuable knowledge from a great deal of semi-structured heterogeneous data by the methods of data mining.Text classification is an important technology of text mining based on web, with the rapidly increasing of Chinese web, Chinese text classification gradually become the Hotpoint of web mining, the essential technologies of text mining include participle technology ,text expression ,preprocess ,weight computation, feature selection and classification arithmetic. The quality of text feature selection affects the accuracy and efficiency of text categorization greatly. TFIDF is a main method of feature selection in text classification, and because of traditional TFIDF without considering the distribution of feature words among classes; the paper analyzed the TFIDF feature selection algorithm, and proposed a new TFIDF feature selection method with concept of information entropy. Experimental results show the method is valid in improving the accuracy of text categorization, the precision and recall is quite satisfying.Many technologies have been applied in text categorization, such as the Nearest Neighbore, Bayesian, decision trees, support vector machines, vector space model and neural networks etc. But these common methods have some shortages. Current many text classification methods only focus on the web pages including one topic, while many web pages including several topics should belong to different classification.This paper proposes a multi-topic web classification method based on vector space model, through comparing the value of every classification similarity with dynamic threshold value, we have accomplished the task that classifying a multi-topicc web page to several different classes, and the experiment results show the method is valid.
Keywords/Search Tags:Text Classification, Information Entropy, Multi-topic Text Classification, TFIDF, Feature Selection
PDF Full Text Request
Related items