Font Size: a A A

Research On Web Information Extraction Technology Based On Deep Web

Posted on:2011-11-06Degree:MasterType:Thesis
Country:ChinaCandidate:P Z WangFull Text:PDF
GTID:2178330332971237Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the information age, there are more ways of obtaining information. The Internet as a carrier of information, information capacity and efficiency in transmission are irreplaceable position. But with the increase of information on the Internet, obtaining information becomes more and more difficult. Search engine improved the current situation, but obtaining professional information is still not easy. At present, most of the page in the website information database, we must pass these data backend database to access data information. Therefore, web information extraction technology becomes a hot research field .In the web information extraction, the web is divided into two areas: Surface Web and Deep Web, SurfaceWeb actually is generally Web HTML by clicking on the link to Web pages. However, Deep Web must use specific database access technology to query the backend database to dynamic generation of the page. In a certain extent, Deep Web information generate through the templates. It presents a data structure. At the same time, in the vertical search engine, structured or semi-structured information extraction technology is one of the key technology. But in the field of perpendicular search search is based on the Deep Web network. At present, the extraction of these data are produced by the wrapper, wrappers formation process, the need for web page for the analysis and generate the extracting rules. On the analysis of the web page of the main data redundant information extraction in generating rules will not only influence the wrapper, and influence the efficiency of the extraction of the accuracy of the results.This paper gives the HTML page data division. HTML pages will be divided into main data area and the main data area, then use HTML DOM tree structure identification and extraction of the regional data. In data area used in recognition of similarity characteristics of leaf nodes on the HTML DOM tree nodes are levels of division, In the block of data extraction, the relationship between nodes, if these nodes satisfy the similarity of nodes, then we can know the location of the data block. In the end, the data is an item of data on the identification using HTML editor tree algorithm of tree edit distance from the best matching, ultimately selected data, and then we will extract the information items. Anyhow, this paper is based on the entire page DOM tree with the level of each node, the comparison with similar characteristics of the nodes. Then according to the characteristics of Deep Web data for extracting rules, and extraction method of tree structured information. Experimental results show that in a certain extent this method can improve the efficiency of data extraction and extraction accuracy of recall.
Keywords/Search Tags:Deep Web, Information extraction, Tree Node, Data region, DOM
PDF Full Text Request
Related items