Font Size: a A A

Post-Processing Of Deep Web Querying Result

Posted on:2012-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:G C MaoFull Text:PDF
GTID:2178330335951065Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of World Wide Web, more and more information is issued in the web, at the same time there are many kinds of web structures appeared on the network. Recently, the whole web is composed of two big parts, one is Surface Web, and the other one is Deep Web. Surface Web can be displayed by querying the traditional search engine. However Deep Web are not the same, most traditional search engines can not find them, they are hidden behind the search forms, and are stored in large searchable electronic databases. In order to get the information, the users must submit the keywords in the interface then the information can be displayed on the web. According to a survey in July 2000, scientists estimated that there were 43000-96000 Deep Web sites and the deep web's contents were 500 times larger than that of the Surface Web, A subsequent survey in April 2004 estimated that there were 307000 deep web sites. The number of the deep web sites had expanded 3-7 times in less than four years.Because of Deep Web contains so much data information that people can not find, it is necessary to extract the Deep Web information, but there are many kinds of web structures, and they also contain a lot of information that people do not care about, for example, advertisements and so on, they not only take up a big part of a web, but also effect the rapid of opening the web sites, it puzzled the users. This paper is just to deal with this problem. We extract the information of selling book online domain and wipe off the dirty information to offer convenience for the users to get the information. Recently, there are many research results in China and abroad, they presented a lot of extracting technologies, for example, information extracting technology based on natural language, information extracting technol-ogy based on DOM, information extracting technology based on XML and so on. But they are all influenced by the structures of the web, face to the complex of web structures, it become too difficult to extract accurately. This paper present an extraction means which with the help of the ontology. Ontology is used to explicitly describe the conceptions and the relations between them in a domain, this make these conceptions and relations have explicit and exclusive definitions in a certain domain. It is not influenced by the structures of the web, and if the domain ontology is big enough, it could exactly complete the information extracting.This paper is composed of two big parts, one is the construction of Ontology, and the other is the extraction of querying result. Recently, most of webs are written by HTML language, this language used many tags to set up the type of the web, this paper just make use of this character, take full advantage of HTML tags and the structure characters of book domain, and use the information matching between the query interfaces and the query result pages of selling book online to construct domain ontology, use the Result Set Extraction Model to get the knowledge of book domain, and then use the tool of constructing ontology-Protege to construct the ontology. At the information extract part, we first use HTML Parser to analyze the web page, at the same time we delete the information that the users do not want to get, for example, advertisement part, navigation part and so on. In this way, we can get a HTML tag tree, we use this tree to match the result of OWL which is parsed by Jena, then identify the information of the page and extract it. Finally, we put the result in the right order and store them in database, complete the extraction work.Finally, we design and perform some experiments to prove it can work, we chose some sites to experiment, for example, Dangdang web, China book web and so on. We also make comparison with the RSEM, the results display that our means can exactly complete extracting information.
Keywords/Search Tags:Deep Web, Ontology, Information extract, HTML Parser, RSEM
PDF Full Text Request
Related items