Font Size: a A A

Research On Selection Of Initial-URLs Based On User Ontology

Posted on:2010-04-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y T WangFull Text:PDF
GTID:2178360275999910Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
These years, as the technology develops, people can enjoy the abundant sources on the internet. Internet constructed based on huge volume of data and its complexity, extreme dynamic and all kinds of users have made the internet source development difficult. So, topic-focused search engine is brought forward, attracting many researchers on it. Its crawler, called "topic crawler", could identify the topic automatically, visit related web pages rapidly and downloaded the web pages selectively. It offers the identical information for different users. However, due to complex structure of the web and crawler's efficiency, how to improve the efficiency in identifying topic, how to download more related web pages in less iteration and how to access authority web pages through the irrelevant web pages are the important issues. The paper works on the portals to the topic crawling. It proves that the initial URLs plays an important role on guiding crawling during the primary stage in theory and in practice. And it puts forward feasible solution for selecting the initial URLs.Firstly, the paper takes experiment on data set to simulate the web structure. It turns out to be true that a topic crawler could download more relevant web pages in less iteration with suitable initial URLs, especially in the primary crawling stage. So, the posed solution is efficient and valuable.Secondly, combining semantic information with linkage, the paper poses an algorithm based on ontology: OntoSelectSeeds. This algorithm has four characteristics:①it improves the HITS algorithm on concerning the linkage only. In the terms of HITS algorithm, owing to neglecting the context of web pages while expanding the root set to the base set, it causes topic drift problem. Thus, the algorithm OntoSelectSeeds weighted expands the user's interest topic with user ontology, and then employs the expanded topic to prune the base set, in order to improve the efficiency in identifying the topics.②by "complete biograph", the problem for extracting the connective graph from a big graph is transferred to the problem for extracting the complete biograph from a topic area, which reduces the difficulty.③After extracting the complete biograph, a couple of nodes sets is produced, named Hset and Aset. With the two sets, the rest topic area could be found easily.④it removes the two sets from the hub pages set and authority pages set respectively, meanwhile, it re-ranks the hub pages set and authority pages set, extracts the complete biograph and then deletes them form the hub set and authority set. It executes these operations repeatedly till the number of the initial URLs is enough.Finally, taking experiments on the solution and evaluating it from two aspects.①Comparing the PageRank of web pages downloaded by selecting the initial URLs by the posed algorithm with selecting a random URL.②Comparing the number of web pages downloaded by selecting the initial URLs by the posed algorithm with selecting a random URL. By analyzing the results, the posed algorithm is efficient. Especially, it is prior to random URLs in the primary stage.
Keywords/Search Tags:Initial URLs, User Ontology, Complete Biograph, HITS, Topic Area
PDF Full Text Request
Related items