Font Size: a A A

Researeh On Web Information Extraction Based On Page Structure Clustering

Posted on:2014-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:H W LiaoFull Text:PDF
GTID:2248330398974098Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Web has become the world’s largest and most complete type mass repository. How to obtain the valuable information from large web automatieally and rapidly is becoming more and more important. The most commonly used language of Web is described in the HTML in this way the rendered page are mostly structured or semi-structured structure. The site is dynamically generated by the data, and topical information is similar in the same template page. For these features of Web page, this thesis proposes a clustering Web information extraction method based on the structure of the page, and designs a prototype system based on this method. The system can classify Web according to structural similarity, generate similar Web extraction rules easily and quickly, and extract the page information rely on the generated rules accurately. The system is divided into three modules:(1) Web pages download module, to achieve efficient Web crawler collection pages;(2) rule learning module, to achieve Web page clustering;(3) information extraction module, to achieve Web information extraction.This thesis studies the structure of Web page and represents Web page into a tree structure with the DOM model firstly. The page structure similarity algorithm is analyzed with the structure. An improved algorithm is proposed and compared with tree edit algorithm and tree path matching algorithm. The hierarchical clustering algorithm with similarity algorithm is used to find similar page. Then, the Web crawler technology and Web pretreatment technology that include web DOM model, web page cleaning and page structure graphical display are studied. Finally, this thesis studies extraction rule representation.The experimental results performed on multiple Web sites show that the method of Web data extraction could extract data records in similar Web pages with high accuracy.
Keywords/Search Tags:information extraction, DOM (Document Object Model), Web structuresimilarity, Web page clustering
PDF Full Text Request
Related items