Research On Selection Of Initial-URLs Based On User Ontology

Posted on:2010-04-14

Degree:Master

Type:Thesis

Country:China

Candidate:Y T Wang

Full Text:PDF

GTID:2178360275999910

Subject:Computer application technology

Abstract/Summary:

These years, as the technology develops, people can enjoy the abundant sources on the internet. Internet constructed based on huge volume of data and its complexity, extreme dynamic and all kinds of users have made the internet source development difficult. So, topic-focused search engine is brought forward, attracting many researchers on it. Its crawler, called "topic crawler", could identify the topic automatically, visit related web pages rapidly and downloaded the web pages selectively. It offers the identical information for different users. However, due to complex structure of the web and crawler's efficiency, how to improve the efficiency in identifying topic, how to download more related web pages in less iteration and how to access authority web pages through the irrelevant web pages are the important issues. The paper works on the portals to the topic crawling. It proves that the initial URLs plays an important role on guiding crawling during the primary stage in theory and in practice. And it puts forward feasible solution for selecting the initial URLs.Firstly, the paper takes experiment on data set to simulate the web structure. It turns out to be true that a topic crawler could download more relevant web pages in less iteration with suitable initial URLs, especially in the primary crawling stage. So, the posed solution is efficient and valuable.Secondly, combining semantic information with linkage, the paper poses an algorithm based on ontology: OntoSelectSeeds. This algorithm has four characteristics:â‘ it improves the HITS algorithm on concerning the linkage only. In the terms of HITS algorithm, owing to neglecting the context of web pages while expanding the root set to the base set, it causes topic drift problem. Thus, the algorithm OntoSelectSeeds weighted expands the user's interest topic with user ontology, and then employs the expanded topic to prune the base set, in order to improve the efficiency in identifying the topics.â‘¡by "complete biograph", the problem for extracting the connective graph from a big graph is transferred to the problem for extracting the complete biograph from a topic area, which reduces the difficulty.â‘¢After extracting the complete biograph, a couple of nodes sets is produced, named Hset and Aset. With the two sets, the rest topic area could be found easily.â‘£it removes the two sets from the hub pages set and authority pages set respectively, meanwhile, it re-ranks the hub pages set and authority pages set, extracts the complete biograph and then deletes them form the hub set and authority set. It executes these operations repeatedly till the number of the initial URLs is enough.Finally, taking experiments on the solution and evaluating it from two aspects.â‘ Comparing the PageRank of web pages downloaded by selecting the initial URLs by the posed algorithm with selecting a random URL.â‘¡Comparing the number of web pages downloaded by selecting the initial URLs by the posed algorithm with selecting a random URL. By analyzing the results, the posed algorithm is efficient. Especially, it is prior to random URLs in the primary stage.

Keywords/Search Tags:

Initial URLs, User Ontology, Complete Biograph, HITS, Topic Area

Related items

1	Research On Selection Of Seed-URLs Based On User-interest Ontology
2	Initial URLS Optimization In Search Engine
3	Construction Of User-Query Semantic Ontology(UQSO) For Personalized Topic Search Engine
4	An Application Of Improved Hits Algorithm In User Influence Valuation Sstem Of SNS Websites
5	Microblog User Influence Based On Improved HITS Algorithm
6	Research On HITS Algorithm In Web Structure Mining
7	Optimization And Implementation Of HITS In Web Structure Mining
8	Research On The Key Technology And Implementation Of The Focused Crawler Based On HITS And Shark-Search
9	Research On Important User Recommendation Methods Based On Personalized Tag And Microblog Topic
10	Topic-Specific Crawling And Search Routing Research Based On Peer-to-Peer Network