Font Size: a A A

A Research On Automatic WEB Documents Extraction And Classification

Posted on:2007-08-29Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q WangFull Text:PDF
GTID:2178360242475527Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Along with the rapid development of Internet,there are abundant ,isomeric,Semi-structured and dynamic information resources on Web. Among these Webinformation,above 80 percent exist in the form of Web text .How to seek and gain the valuable information and knowledge model from these vast Web information resources, have already become the question urgently awaited to be solved in the information processing domain .The questions mentioned above can be resolved effectively by Web text classification,which origins from ATC (Automatic Text Classification),and is the key constituent of Web text mining.It can classify search results,which not only enhances the efficiency of search for Web users, but also improves the ability of localization to goal knowledge, and extracts the valuable knowledge.This article first introduced the text classification main method, has analyzed the WEB documents characteristic , proposed the WEB text classification needs to study two technical stratification planes: The information extracts and the text classification, studied the methods of IE based on the vision and multi-level text classification based on SVM. We designed theChinese web text categorization software with web spider model, Chinese word sputter model, feature selection model and machine learning model included.At last we draw an experience to test the accuracy of these methods using the Classification System and Sql Server 2005 Text Mining. As the experiment result show, this software has high accuracy.
Keywords/Search Tags:Spider, Information Extraction, Text Classification, SVM, Classification search engines
PDF Full Text Request
Related items