Font Size: a A A

Study On Web Content Extraction And Semantic Recognition

Posted on:2007-12-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y M LiuFull Text:PDF
GTID:2178360182483039Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the development of the internet, the web as a powerful network media,is actually a double-face sword. Whereas publicize and guide actively, it is hardto avoid being by hostile force. So the study of the information securitytechnology on the web has already became an important research branch of theinformation security. And text classification is the most effective way to carryout the web information security, Extracting the topical content accurately andefficiently is a key technique to improve the accuracy of web classificationsystem.. This paper does some research on web page layout analysis and topiccontent extraction and text classification on extracted text.At first, the characteristics of the web page and HTML language as well asthe characteristics and applications of DOM tree are introduced. Secondly,related works are introduced and contrast analysis of some method on web pagelayout analysis and topic content extraction is made. On this basis, our methods-method based on statistics and method based on coordinate tree are proposed.The method based on statistics is easy and effective, which makes use of thecharacteristics of the web design and HTML source code in nature, and has highaccuracy. The experiments show that the method is feasible and of goodprecision. However, it is only good for topic text and can not extract themultimedia information with related links and related pictures. So a novelmethod based on coordinate tree is proposed. Which considering the lack ofposition information of the DOM tree, add coordinate information in order tocreate coordinate tree, and bring forward to a graph model reflecting the spatialrelation. By transforming HTML documents into Coordinate trees, the webpages are analyzed and extracted based upon the features of position and spatialrelations. Experiment result on a set of 5 000 web pages from 120 different sitesshows that the approach can achieve 93.78% in terms of accuracy, and it alsohas better precision and recall on extraction of the related links and relatedpictures. Finally, there are some studies on text classification, a textclassification algorithm that is based on SVM-Decision tree is presented. Theresult of the test on extracted text is satisfied.
Keywords/Search Tags:Page layout analysis, Content extraction, DOM, Coordinate tree, Heuristic rules, Tokens statistics, Text classification
PDF Full Text Request
Related items