Font Size: a A A

The Research Of Chinese WEB Acquisition And Automatic Categorization

Posted on:2008-08-14Degree:MasterType:Thesis
Country:ChinaCandidate:H Z WuFull Text:PDF
GTID:2178360215974393Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Internet, which containing more and more information with its rapid development, is becoming an important knowledge-acquisition source in our work and life. However, because of its opening and heterogeneity, it is very difficult for users to obtaining the information their needed. So, how to handle and organize the vast number of online information has become an important research field. Traditionally, the web classification is manually. After read by the professor, the web classify in one or more category. Nevertheless, as the fast growth of the web, classifying web upon the artificial method will be inappropriate and difficult. In order to help the user locate the information easier, more scholars began to study the web classification technology.The Chinese Web acquisition and classification technology are researched in this dissertation. The content is as follows:1,This dissertation presents the web acquisition technology which integrating Google Web API into Java application to search and acquisition Web and introducing regular expression to find out the other URL in the Web.2,This dissertation describes the main procedures of Web classification in detail. The web pretreatment process, including web clean and Chinese words participle, is discussed. Then each kind of participle technologies are analyzes and ICTCLAS is introduced later.3,Three kind of Chinese text expression model are compared in this dissertation. It is the vector space model to be used in this dissertation. Then, this dissertation compares all sorts of feature select arithmetic. Because the function of the different lexical category word in the text is not same, the feature select based on part of speech is proposed to reduce the dimension of the characteristic vector. This method improves the efficiency of the feature selection by eliminating the noise information before computing the weight of feature words.4,Then this dissertation compares some kinds of text classification algorithm, introduces the KNN in detail and proposes the corresponding improvement method which rebuilding the feature vector of the text to improve the efficiency.5,The Chinese web classifier proposed in this dissertation is evaluated. It is indicated by the experiment that this method proposed in this dissertation improve the recall rate and efficiency of classification without damaging the precise of the classification.
Keywords/Search Tags:Web acquisition, Chinese words participle, feature selection, text classification algorithm
PDF Full Text Request
Related items