Font Size: a A A

Key Techniques On Deep Web Data Extraction

Posted on:2013-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q CaiFull Text:PDF
GTID:2248330374481408Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With Internet technology and related technology is developing rapidly, Web become the biggest source of information. Effective obtaining and integrating Web data can provide a strong help to analysis and mining of the data, has a very important value and significance. However, in fact, it is unrealistic and difficult to integrate this information only with the manual way. So Web data integration automation technology has become one of research focus. Web data extraction automation is one of the key issues in Web data integration automaton.Web data extraction includes Web data obtain and Web data semantic annotation. It means that identifying and obtaining information which user need in Web page. And processing those information to make them become structured data and has clear semantics and can be used by computer.According to the depth of the access to information in the Web, the whole Web can be divided into Surface Web and Deep Web. At present, the capacity of the information stored in Deep Web has already far beyond that in Surface Web. And the quality of the information in Deep Web is better than Surface Web, so it has a higher value than Surface Web. This paper mainly study data extraction in Deep Web.This paper mainly studies the application of Deep Web Data Integration system. About several key issues in Deep Web Data Integration, The innovative work presented in this paper include the following two aspects:(1) Concluding pattern of web page based on string pattern matchingPresently, each Web site has its own topics and formats to arrange the page structure and present information. Therefore, there is a great need for value-added service that extracts information from multiple sources. Data extraction from HTML is usually performed by software modules called wrappers. Existing methods of concluding pattern of Web page is difficult to handle complex structure of Web page, affecting the accuracy and reliability. This paper proposes a novel and effective method can generated the pattern of the Web site automatically. In our method, the algorithm bases on string pattern matching can discover the nested structure and the repeated structure in a Web page automatically. Then a regular expression will be generated as the pattern of the Web site.(2) Semantic annotation of Web data based on visual feature of the pageExisting methods of semantic annotation of Web data mostly focus on the data itself, considering the visual feature and mode feature of Web data and the logical relationship between data items. In many cases, it is difficult to handle adjacent data items having similar semantic feature and mode feature only by analyze visual feature and mode feature of Web data. This paper presents a method based on visual feature of the page constrained conditional random fields to solve the problem of semantic annotation of Web data. This method introduces the visual features of Web page as constraints of conditional random field annotation model to improve the performance of semantic annotation. First, in the process of extracting data of the similar Web pages, we can get visual features of the data on the Web page easily, and generates the sequence of the visual features of Web data. Then we obtains the public visual features of each semantic data items by analyzing the sequences of the visual features of Web data in the sample, and use the public visual features of data items to build the constraints of the page visual features of each semantic data items. The introduction of the visual features of the data on the Web page can accurately annotate the two data items which are adjacent and have very similar semantic features or mode features and effectively improve the performance of Web data semantic annotation.
Keywords/Search Tags:Web Data integration, Web Data Extraction, Pattern of Web Page, Web Data Semantic Annotation
PDF Full Text Request
Related items