Font Size: a A A

Research Of Web Information Extraction Based On XML

Posted on:2014-10-18Degree:MasterType:Thesis
Country:ChinaCandidate:B X ZhengFull Text:PDF
GTID:2268330425483337Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advent of internet, it makes people could widely share information become possible, however, It also becomes increasingly difficult to get their favorite data from the vast amounts of data, and various deficiencies of search engine has highlighted. In the context, people began to focus on the Web information extraction technology.At present most pages on the Internet is composed of HTML code, because it is the earliest written language web pages, so will inevitably have many shortcomings, such as non-standard code, overstaffing in organizations, the content of Web was mixed together with the data contain in the Web. These disadvantages make based on the implementation of the HTML of the Web information extraction system becoming difficult. On the versatility and accuracy is difficult to satisfactory. XML is a simple, platform-independent and widely adopted standards. The most critical advantage of XML is that its data is separated with the user interface. At the same time, using XML implement information extraction system often has better stability and extensibility, extract the accuracy of the results is also higher. To sum up, in this paper, based on the XML language to realize the Web information extraction research, discusses the related technologies in the Web information extraction, the application of the Web information extraction model is established, implemented semi-automatic extraction of Web information. Main researches are as followed:(1)A method combined URL comparison and the algorithm based on optimal free matching for children trees is used in order to solve the problem of similar Pages acquirement and successfully solved poor efficiency of extraction due to the different sample web page. Combining the method of URL comparison with algorithm based on optimal free matehing for children trees, Not only from the Web page of external performance considering the similarity between pages, taking into account the page more internal composition structure, the resulting page similarity to measure the similarity between the page very well.(2)Using the DOM tree, Obtain the target page corresponding to the DOM tree model, and in the DOM tree, use XPath generation algorithm, get the user interested in the information point of XPath path expression.(3)Making full use of XSLT in conversion of the advantage of XML documents, and at the same time will get XPath path with the combination of formation mode, unified rules of the extraction module, convenient to realize the efficient extraction of key information.Experiments show that the proposed xml-based Web information extraction system can realize the sample page key information extraction, and has higher recall rate and precision.
Keywords/Search Tags:Web information extraction, XML, DOM, XPath, XSLT
PDF Full Text Request
Related items