A Research On Automatic WEB Documents Extraction And Classification

Posted on:2007-08-29

Degree:Master

Type:Thesis

Country:China

Candidate:Z Q Wang

Full Text:PDF

GTID:2178360242475527

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

Along with the rapid development of Internet,there are abundant ,isomeric,Semi-structured and dynamic information resources on Web. Among these Webinformation,above 80 percent exist in the form of Web text .How to seek and gain the valuable information and knowledge model from these vast Web information resources, have already become the question urgently awaited to be solved in the information processing domain .The questions mentioned above can be resolved effectively by Web text classification,which origins from ATC (Automatic Text Classification),and is the key constituent of Web text mining.It can classify search results,which not only enhances the efficiency of search for Web users, but also improves the ability of localization to goal knowledge, and extracts the valuable knowledge.This article first introduced the text classification main method, has analyzed the WEB documents characteristic , proposed the WEB text classification needs to study two technical stratification planes: The information extracts and the text classification, studied the methods of IE based on the vision and multi-level text classification based on SVM. We designed theChinese web text categorization software with web spider model, Chinese word sputter model, feature selection model and machine learning model included.At last we draw an experience to test the accuracy of these methods using the Classification System and Sql Server 2005 Text Mining. As the experiment result show, this software has high accuracy.

Keywords/Search Tags:

Spider, Information Extraction, Text Classification, SVM, Classification search engines

PDF Full Text Request

Related items

1	The Study And Implementation Of Web Information Extraction Mechanism Based On Classification Semantics
2	News Selection And Classification Based On Triple-play Service
3	Information Retrieval Oriented Text Classification Technology Research
4	Semi-supervised Web-page Classification And Its Application In Directory-style Search Engines
5	Design And Implementation Of Classification And Aggregation System Of Scientific And Technological Information
6	User Web Information Collection And Analysis System Based On The Smart Router
7	Research And Implementation Of Focus Crawling Spider Based On A. T. C And Optimzied Hyperlink Chosen Strategy
8	Mongolia Web Spider, Text Encoding Recognition And Conversion Research
9	Text Classification Based On Natural Dimension Of Webpage
10	Web Information Retrieval System Based On Classification Semantics