Font Size: a A A

Chinese Pages Hierarchical Classification

Posted on:2008-09-09Degree:MasterType:Thesis
Country:ChinaCandidate:F GuFull Text:PDF
GTID:2208360215484758Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As internet is prevailing rapidly in China, Chinese Webpage is increasing at a speed of millions times, which undoubtedly brings huge amounts of information to us. What we should solve urgently is how to manage and use these information when facing such enormous information. Webpage classification has been one of the hot topics research in IT these years. It is a meaningful and challenging research subject that we apply the technology of Chinese processing and text classification better to the categorization of Chinese webpage by aiming at the structure character of the Chinese webpage.Most of the traditional technology of Chinese text classification preprocess the text by using the system of Chinese language segmentation, which is not good at recognizing people's name, placename and the blendword of Chinese and English. This essay presents a method of sequential data mining to substitute the existing word segmentation system while discussing the key technology of the classification of text. We perceive one text as a series of character strings in the unit of character, and construct a storing structure of tree with the improved technology of PAT-tree, and then mine the character strings of high frequency as the candidate character by integrating the computing method of net frequency. The experiment suggests that the sequential mining method could better identify the people's name, placename, the new words in the webpage and common gerundial phrases and substantival phrases from the Chinese webpage.Facing huge amounts of internet information, we need more categories to manage them while the category of monolayer is too isolated, disorganized and huge. The multi-hierarchy classification that has a structure of administrative hierarchy, obviously, is more efficient than monolayer classification in managing text. Compared to the method of monolayer classification, the hierarchical categorization methods has its own character and technical problems. Now, the probing of hierarchical categorization methods both home and abroad does not go very deep. This essay raises some possible solutions with regarding to the problems of hierarchical categorization, and constructs a model of hierarchical categorization of my own.Finally, this essay implements a system of webpage classification based on the method of sequential mining by combining with the technology design raised before. The experiment suggests that the speed of classification is much faster than the traditional classification system under a situation when the classification system which applies the method of sequential mining and the traditional classification system have the same classification precision.
Keywords/Search Tags:webpage classification, sequential mining, PAT-tree, net frequency, hierarchical categorization
PDF Full Text Request
Related items