Study On Web Content Extraction And Semantic Recognition

Posted on:2007-12-21

Degree:Master

Type:Thesis

Country:China

Candidate:Y M Liu

Full Text:PDF

GTID:2178360182483039

Subject:Communication and Information System

Abstract/Summary:

With the development of the internet, the web as a powerful network media,is actually a double-face sword. Whereas publicize and guide actively, it is hardto avoid being by hostile force. So the study of the information securitytechnology on the web has already became an important research branch of theinformation security. And text classification is the most effective way to carryout the web information security, Extracting the topical content accurately andefficiently is a key technique to improve the accuracy of web classificationsystem.. This paper does some research on web page layout analysis and topiccontent extraction and text classification on extracted text.At first, the characteristics of the web page and HTML language as well asthe characteristics and applications of DOM tree are introduced. Secondly,related works are introduced and contrast analysis of some method on web pagelayout analysis and topic content extraction is made. On this basis, our methods-method based on statistics and method based on coordinate tree are proposed.The method based on statistics is easy and effective, which makes use of thecharacteristics of the web design and HTML source code in nature, and has highaccuracy. The experiments show that the method is feasible and of goodprecision. However, it is only good for topic text and can not extract themultimedia information with related links and related pictures. So a novelmethod based on coordinate tree is proposed. Which considering the lack ofposition information of the DOM tree, add coordinate information in order tocreate coordinate tree, and bring forward to a graph model reflecting the spatialrelation. By transforming HTML documents into Coordinate trees, the webpages are analyzed and extracted based upon the features of position and spatialrelations. Experiment result on a set of 5 000 web pages from 120 different sitesshows that the approach can achieve 93.78% in terms of accuracy, and it alsohas better precision and recall on extraction of the related links and relatedpictures. Finally, there are some studies on text classification, a textclassification algorithm that is based on SVM-Decision tree is presented. Theresult of the test on extracted text is satisfied.

Keywords/Search Tags:

Page layout analysis, Content extraction, DOM, Coordinate tree, Heuristic rules, Tokens statistics, Text classification

Related items

1	A Framework Of Web Page Analysis And Content Extraction Based On Coordinate Tree
2	Study On Web Page Ratinality Of Universities’ Websites In Hebei Province
3	The Research And Application Of Segmentation Method Between Image And Text In Layout Analysis
4	The Research And Implementation Of Web Text Classification That Use Table Information
5	Research On Layout Analysis And Text Line Extraction Of Document Image
6	The Designation And Implementation Of Business Insight System Base On Web Content
7	Information Filtering Technologies Based On Heuristic Rules And Text Classification
8	Research On Content Extraction In HTML Web Pages Based Multi-Features
9	Reasersh On Internet Public Opinion Information Extraction And Classification
10	Analysis Of Deep Web Page's Structure And Its Rich-Content Extraction