Font Size: a A A

Research On Technique Of Self-adaptive Web Data Extraction

Posted on:2017-07-30Degree:MasterType:Thesis
Country:ChinaCandidate:X L ChenFull Text:PDF
GTID:2348330482999732Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Since 1990s, Internet technology has been developed rapidly and in-depth, resources of information on the Internet are becoming grown explosively. Today, the Internet has become a huge and open knowledge base. However, the format of information is very complex. Usually, in addition to the main body of a web page, it also includes a navigation bar, advertising links, related readings and other noise information. The existence of the information greatly reduces the efficiency and accuracy of search engines. Studying how to adaptively extract information on the Internet, therefore, has become an important research topic.The Web page is generally semi-structured pages, the content is lack of strict grammatical rules, so the traditional natural language processing technology cannot do well in the Web page information extraction. In addition, the Web pages rendered by browsers contains a large number of visual features and statistical features, etc., which can be used to implement the Web data extraction. The research of this paper is mainly aimed at the demand of public opinion analysis and studies the data extraction technology of public opinion on Web. The specific researches are as follows:1. A content extraction method based on the visual features of Web pages is proposed for the Web content information extraction problem. Rules of VIPS algorithm are improved according to the characteristics of HTML5 and web page layout, then apply the VIPS algorithm to divide a web page into some independent semantic blocks. Propose some rules according to the visual and statistical characteristics of Web pages to extract public opinion data, using these rules to remove those blocks which contains no content information and extract those blocks which contains content. At last, the data in the remaining blocks is composed of the content of the Web page.2. Design an adaptive Web data extraction method. This method applies Xpath expressions to extract entity data from Web pages. And some templates are used to record the characteristics of the data. If the page structure changes, resulting in fail to extract data using the original XPath expressions, then the method will retrieve data according to the records in the templates. To increase the efficiency, the algorithm implements a retrieval strategy from a leaf node to the root of the page. After the success of the retrieve, the data is gotten and XPath expression will be updated. This algorithm is self-adaptive after web page structures change, and reduces artificial intervention.3. Design and implement a Web content information extraction system and entity data extraction system after studying the DOM tree and XPath technology.Experiments on data sets collected from 10 news and forum websites show that the content extraction method reaches a higher accuracy than the traditional algorithms. And the data extraction method also reaches a high rate of accuracy extracting data from the changed Web pages.
Keywords/Search Tags:VIPS Algorithm, Visual Features, Content Extraction, Data Extraction
PDF Full Text Request
Related items