Font Size: a A A

Deep Web Mining Based On User Browsing Behavior

Posted on:2013-01-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:J T JiangFull Text:PDF
GTID:1228330377951755Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Recently with the rapid development of Internet, the World Wide Web contains tremendous valuable information, and the information is still growing with a fast speed. Generally, information in the Web is mainly published via static Web pages, and each static page contains a number of outgoing URLs pointing to other static pages. The traditional search engines just make use of these outgoing URLs to collect, index and show the pages and information. However, besides the static Web pages, a large proportion of information in the Web is stored in online Web databases. Such information does not exist in the static pages, but can be dynamically retrieved and displayed as dynamic Web pages to the users according to the queries provided by the users. Due to the lack of static URLs pointing to such dynamic pages, it is hard for the traditional search engines to discover them, and thus such information is "hidden" to users. Therefore, the collection of such "hidden" information is called Deep Web (also named as Invisible Web or Hidden Web). And correspondingly, the collection of static Web pages is called Surface Web.Now, the information in Deep Web is much more than Surface Web, especially for the high quality information in Deep Web, which is more than2000times of that in Surface Web. However, currently it is still a huge challenge to effectively and completely exploit the high quality infonnation in Deep Web, and the most important problems are Deep Web discovery and Deep Web crawling. There have been some research works on these two problems, but they are hard to be applied in large-scale applications. It is because that they have respective disadvantages, for example, some works need human interaction and some depend on specific topics. In this dissertation, around the problem of Deep Web mining, we mainly focus on the problems of Deep Web discovery and Deep Web crawling, in order to make it convenient for users to exploit Deep Web information and encourage the development of Deep Web. After carefully investigating the user browsing behavior and summarizing the specific user browsing path in Deep Web, we proposed automatic, topic independent and efficient methods for Deep Web discovery and Deep Web crawling respectively, which make it possible for Deep Web mining in large-scale applications.The main contents and contributions of this dissertation are as follows: 1. Deeply investigated the user browsing behavior in Deep WebFirst deeply investigated the user browsing behavior in Deep Web and Surface Web, transformed it into a visualized graph (browsing map), and carefully compared the user browsing behavior in Deep Web and Surface Web. After that, based on the pages’function, layout and the URL rules in Deep Web, proposed a model user browsing path:Form Pageâ†'List Pageâ†'Object Page. This browse path well presents the specific characteristics of user browsing behavior in Deep Web. To the best of our knowledge, this is the first time that such a concept is proposed.2. Proposed an efficient method for Deep Web discoveryBased on the specific user browsing path in Deep Web, proposed an efficient method to discover Deep Web sites from Browse Logs. This method first clusters the form pages, list pages and object pages through URL clustering, and rebuilds the browse map based on the jumps between pages. Then it tries to detect the specific user browsing path from the browse map. Next, if a user browsing path is detected and it satisfies some requirements, this site is considered as a Deep Web site. It is very efficient and also topic independent as it uses URL clustering instead of fetching the pages and clustering pages. In addition, discovering Deep Web sites from browse logs reduces the cost in further, and increases the precision of Deep Web discovery and the probability of discovering high quality Deep Web sites.3. Proposed an efficient method for Deep Web crawlingBased on the specific user browsing path in Deep Web, proposed an efficient method to crawl Deep Web sites. Observing that the users visit a large number of object pages during their browsing, we try to simulate the user browsing to collect as many object pages as possible. Starting from the form page, the method first collects a number of list pages; then it makes use of HTML DOM tree alignment technique and the layout of object URLs to detect object URLs from the collected list pages; next, it takes advantage of the characteristics of page-flipping URLs to detect page-flipping URLs from both list pages and object pages. After collecting enough URLs, the method learns URL rules from the detected URLs, and uses the learnt URL rules to crawl the target Deep Web sites in order to increase the crawling efficiency.
Keywords/Search Tags:Deep Web, Deep Web Mining, User Browsing Behavior, BrowsingPath, Deep Web Discovery, Deep Web Crawling
PDF Full Text Request
Related items