Font Size: a A A

Chinese WEB Document Automatic Categorization

Posted on:2006-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:B ChenFull Text:PDF
GTID:2168360152470633Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As information on Internet is available in abundance, Internet is becoming a vital source of knowledge getting. But information is too much to look up valuable information efficiently. For this reason, it is very important to neaten the information on internet. Our research focus on Chinese web document automatic text categorization which is the core technology for internet research engine. A software for automatic Chinese web document categorization was developed which is the base for future research.The main methods are discussed in this paper. And then the primary difficulties are proposed as to the feature of Chinese text. We designed the Chinese web text categorization software with web spider model, Chinese word splitter model, feature selection model and machine learning model included. The primary function and arithmetic with java source code are discussed in this paper. At last we draw an experience to test the accuracy of the software to category Chinese web document. As the experiment result show, this software has high accuracy.This paper is composed of seven chapters following.Chapter 1: This chapter introduces the technology for text categorization. The particular features of Chinese web text is discussed in this chapter. And the main researching task is proposed later.Chapter 2: The whole design of automatic Chinese web text classifier is described in this chapter. The primary function of each module is discussed. And the new methods proposed by us are also discussed in this chapter. Chapter 3: This chapter presents the principle of web spider and the java application of web spider. The most important technology in web spider known as HTML parser is also discussed.Chapter 4: Chinese text splitter is described in this chapter. Based on analysis of all sorts of Chinese text splitter arithmetic, we discussed how to improve max match Chinese text splitter arithmetic. The Chinese dictionary based on hash table is discussed in this chapter.Chapter 5: This chapter compares all sorts of feature select arithmetic. The advantage and disadvantage of these arithmetic are summarized. We proposed anew arithmetic named as DFTF( Document frequency and Term Frequency). Wegive out its reality with java source code.Chapter 6: Naive Bayes machine learning method is discussed in this chapter.Especially, we discussed the arithmetic of how to category the Chinese webdocument with naive bayes machine learning. And then we present how toreality such a classier.Chapter 7: This chapter present how to evaluate the quality of Chinese webdocument classifier. As the experiment result show, high category quality isobtained on this classifier. We also summarized the gain and defect of thisproject. Further, we discussed how to improve this classifier in future research.
Keywords/Search Tags:text categorization, Chinese text splitter, Naive Bayes machine learning
PDF Full Text Request
Related items