Font Size: a A A

Research And Application Of Web Text Classification

Posted on:2007-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:H Y KeFull Text:PDF
GTID:2178360182980901Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development of Internet , there are abundant ,isomeric, semi-structured and dynamic information resources on Web. Among these Web information , above 80 percent exist in the form of Web text. How to seek and gain the valuable information and knowledge model from these vast Web information resources, have already become the question urgently awaited to be solved in the information processing domain . The questions mentioned above can be resolved effectively by Web text classification , which origins from ATC (Automatic Text Classification), and is the key constituent of Web text mining . It can classify search results, which not only enhances the efficiency of search for Web users, but also improves the ability of localization to goal knowledge, and extracts the valuable knowledge.On basis of analyzing the present research situation and existing question of Web mining and Web text mining , this thesis mainly studies the essential technologies of Web text classification, the common text classification methods and the mixed method of Web text classification based on Rough set and KNN .The main research works are shown as follows .(1)Introduce the basic theory and the relevant knowledge of Web mining and Web text mining , and analyze the research background, the present situation and the existing questions of Web text mining and Web text classification.(2)Analyze the essential technologies detailedly in the process of Web text classification, such as preprocess, participle technology, text expression, weight computation, feature selection and extraction , dimension descending technology . five influence factors for evaluating classification performance and several commonly appraisal methods of classification methods are discussed.(3)Discuss several general text classification methods: KNN, vector distance method based on VSM, Bayes classification , support vector machine classification, decision tree and so on, analyze and compare the advantages and disadvantages of these classification methods.(4)Propose one kind of mixed classification model of Web text based on rough set and KNN. Using the theory of attributes reduction of rough set, dimension of vector can be reduced in the process of text classification, and use one kind of simplified algorithm for attributes reduction based on distinct matrix. In the process of feature selection, the method of mutual information is used. A series of experiments have been done , and the results show that such mixed algorithm is feasible compared with traditional KNN method .
Keywords/Search Tags:Web Text Mining, Web Text Classification, Rough Set, K Nearest Neighbor, Attibutes Reduction
PDF Full Text Request
Related items