Font Size: a A A

Research On Key Technologies For Automatic Chinese Web Page Categorization

Posted on:2007-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:J Q ZouFull Text:PDF
GTID:2178360182473216Subject:Computer applications
Abstract/Summary:PDF Full Text Request
With the rapid development and popularity of World Wide Web, the number of online electronic information increases exponentially, and people have already transited into the ages in which information is extremely abundant and digitized from the age lacks information. Facing the vast number of online information, it's hard for us to acquire the real useful information quickly and effectively. Thus, how to handle and organize the vast number of online information, has become an important research subject gradually. Traditionally, web documents are classified manually. But it is time-consuming and labor-intensive. Due to this, the automatic text categorization has been put forward and studied to deal with the disorder phenomena of online information. Also, combined with the technologies of information retrieval, search engine and information filtering, it has become one of important tools to handle the problem of acquiring information on the Internet. In this paper, we briefly introduce automatic text categorization. On the basis of it, we mainly studied some key technologies of Chinese web page categorization. Firstly, this paper presents a formalized description of the filtering-based feature selection approach. Then, based on the analysis of the characteristic of many common feature selection approaches, it puts forward a new feature selection method using multi-criteria. Secondly, based on the fact that the vector space model is incapable of expressing the structure of documents effectively, this paper put forward a new document representation using graph model and its similarity measure criteria, and then it was applied to Chinese web page categorization. Empirical results show that the graph model is feasible. Finally, real-world applications often require the classification of web documents under the situation of noisy data, but support vector machines themselves cannot deal with it well, so a new noise-tolerant support vector machine was present in this paper. Then, this new noise-tolerant support vector machine was applied to Chinese web page categorization.
Keywords/Search Tags:Automatic Chinese Web Page Categorization, Feature Selection, Graph Model, Noise-tolerance, Support vector Machine
PDF Full Text Request
Related items