Font Size: a A A

Web Information Extraction Based On Principle Part Extraction

Posted on:2014-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:J YuFull Text:PDF
GTID:2248330395483815Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the amount of data on the Internet has explodedover the past years, which causes Web to be an important channel for the informationdissemination and sharing around the world. However, semi-structured is an inherent property ofWeb, and a lot of information on a Web page has nothing to do with the subject of the page, sothat people can not obtain the required information quickly and accurately. Therefore, research onhow to extract needed information from the Web has become increasingly important.Domestic and foreign scholars have made a lot of research in this field. However, after theanalysis of both existing Web information extraction methods and the current Web page features,we have found that the existing extraction technologies has insufficient degree of automation andinaccurate extraction results. To compensate for the two deficiencies, this thesis presents a Webinformation extraction method based on principal part extraction. This method consists of fourmodules: page preprocessing, page principal part extraction, extraction rules generation, andinformation extraction. The page-preprocessing module uses JTidy to format HTML tags, and towipe off partial contents of a web page that are unrelated to the subject of the page. The pageprincipal part extraction module gets the page structure tree by using HTMLParser to parse thepage, and then recognizes the principal part by analyzing the characteristics of the structure treebased on the proposed MMTD-based algorithm. The extraction rules generation module employsXPATH and XSLT to generate the page extraction rules for principal parts of web pages. Theinformation extraction module gets the data needed by applying the extraction rules to the page tobe extracted, and then puts them into a database for people to look up and use them conveniently.In the above processes, the Web information extraction is completed based on the extraction ofthe page principal part, which is the reason for the name of the proposed method.The proposed method is an automated information extraction method where the entireextraction process almost needs no human participation. Compared to the existing methods, it hasa higher automatization degree. Besides, the use of the powerful and flexible XPATH and XSLTsimplifies the rules generation greatly, and it also improves the versatility and the accuracy of themethod.A prototype system of web information extraction based on the principal part extraction isdesigned and implemented according to the above. The system realizes the extraction of webinformation by combining the functional modules, and provides a visual interface for users tooperate easily. Finally, experiments are conducted on some mainstream websites, experimentalresults demonstrate that the proposed method is effective and correct.
Keywords/Search Tags:Web information extraction, JTidy, MMTD, XSLT, HTMLParser
PDF Full Text Request
Related items