Font Size: a A A

Research On Topical Web Crawling Technique For Topic-Specific Search Engine

Posted on:2008-04-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:T PengFull Text:PDF
GTID:1118360212997743Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid expand and growth of web pages information from the WWW, it gets harder to retrieve the information and knowledge relevant to a specific domain. Therefore, topical web crawling technique for retrieving the specific-domain information has got more attention and development in recent years. Topical web crawling has been applied for not only a topic-specific search engine, but also other field such as digital library, etc. Accordingly, the research on topical web crawling will be an academic signification and a broad application perspective. The major contents could be summarized as follows:(1)This dissertation makes a general summary of the research on topical web crawling and the correlative techniques, analyzes the derivation background and the course of development. After introducing and analyzing the development of search engines and the text classification, the virtues and necessary of a topic-specific search engine be presented. Furthermore, the future of search engines is also discussed in this dissertation. The basic theory and strategies of topical web crawling and text classification technique are also introduced and analyzed, which are the groundwork of farther research works.(2)It is an effective topical web crawling approach that the relevance of a target web page is evaluated by using web page information. Generally, a knowledgeable human being can identify the target web page before its display, which is natural. How to make a computer imitate a human being to identify the target web page will be a challenge. Some anchor text and link information are too short, not enough informative, so it is not reliable forecasting target web pages only by anchor text. We expand it to link-contexts, which is the best method to enrich anchor text. There are several link-contexts extracting methods now. After analyzing and comparing them, a new link-contexts extracting method ?ζ?IDOM is presented in this dissertation. The experimental results show that the performance of the topical web crawling using the method with some appropriate parameters is improved. To get the initial feature set of the topical web crawler, we adopt a method to extract the contexts of backward links of seed URLs, which are ordinarily the summary of the content of seed URLs. By this method we can get small but exact feature set of the topic, which is used to instruct the crawling of the topical web crawler. Based on the strategy of link-contexts-based topical web crawling, this dissertation presents a self-adapted method, which evolves the feature set of the topic during the process of crawling. The experimental results show that the method is able to strengthen the self-adaptability of the crawler and meanwhile avoids the phenomenon of topic drift.(3)Particle Swarm Optimization (PSO) is applied in many fields because it is effective, understandable and easy to realize. PSO is a computational intelligence method motivated by the social behavior of organisms, which is an organism-based and iteration-based optimization strategy. Based on original PSO algorithm, this dissertation presents an improved algorithm named BWPSO. The experimental results show that BWPSO require less times of iteration than standard PSO to get the same results, so we can conclude that BWPSO is applicable and more efficient to solve optimization problems.(4)Text classification that is used to instruct topical web crawler is a key technique in the research of web information retrieval. In this dissertation we assign different weights to different content in the same web page according to its structure, and search and utilize the rules that web pages have in common to compute the feature weights of the pages. Because the web pages are diverse and the training set is always not representative enough, we build a set of classifiers by iteratively applying SVM algorithm on training set. Because PSO is an iteration-based optimization strategy, we use BWPSO to synthesize classifiers result from the iteration of training to get the final classifier. The experimental results show that through the synthesization of classifiers, the partition of web page structure and the using of common rules to compute feature weight, the performance of the classifier is great improved.(5)Since the web pages are frequently increased, deleted and modified, the research on incremental topical crawling is of great importance. In this dissertation, the technique of incremental topical web crawling contains two parts: incremental learning in algorithm and incremental web pages updating. Incremental leaning in algorithm presents the self-adaptability and self-improvement of incremental learning. In topical web crawling, it is impossible to predefine the entire sample set relevant to a specific topic. In this situation, incremental learning has to use some strategies to select favorable training data to improve itself during the process of training. In the research of incremental learning in algorithm, this dissertation applies improved 1-DNF algorithm to extract reliable negative data from the training set of PU classification problem, and then constructs classifier by using SVM iteratively in incremental learning. In SVM classification algorithm, the determinant data is Support Vector (SV). The partition of Support Vector set equals to the partition of the whole data set. Accordingly, during the process of training, this dissertation uses SV set, samples of false classification results, and unclear samples of the correct classification results generated from the process of iteration as the training set in the next iteration. In doing so, we can save training time and space, and this method is not detrimental to classification accuracy. The research on incremental web pages updating concentrates on the deducing and identifing of changed (increased, modified) web pages in latter crawling. The key problem is to search and identify the rules by which the web pages change. According to the rules, the crawler can only crawl the pages that have changed, then we can save the bandwidth of internet and meanwhile make the web pages up-to-date. After the definition of incremental web pages updating, this dissertation presents the rules used to determine whether the page has changed. We use a method based on DOM tree to estimate the content of web page, and then analyze the randomness of the change of web pages. Through the sampling estimation based on experiments, we evaluate the change frequency of web pages according to relevant mathematic models. Finally, we present the crawling structure and algorithm using incremental web pages updating. Experimental results show that through self-adjusting parameters, this method is able to get the rules by which web pages change.(6)Due to the complexity of the web environment and topic-multiplicity of the contents of web pages, it is quite difficult to get all the web pages relevant to a specific topic. It is possible for an irrelevant web page to link a relevant web page, so we need to traverse the irrelevant web page to get more relevant pages. This procedure is called Tunneling. This dissertation partitioned Tunneling into Grey Tunneling and Black Tunneling. Grey Tunneling resolves the problem that the topic-multiplicity of a web page makes the relevance of the highly relevant page been weakened. So during the process of crawling, in order to avoid the effect caused by the web page that is irrelevant to the specific topic as a whole but relevant partially, we divide a multi-topical page into several blocks and process the blocks individually, and then we can traverse the page that is irrelevant as a whole to expand the scope crawer reached and get more relevant pages. In Black Tunneling, we present a probing method, in which we first store all the irrelevant pages temporarily, and meanwhile extract all the out-links to crawl. During the process of crawling, we assign a depth value used to determine whether the page should be kept to each irrelevant page according to the relevance of its father page, and then we can broaden the scope of the topical crawler. The experimental results show that the two tunneling methods have achieved the effect we expected.(7)Similar to general search engine, the construct of topical search engine is composed of three parts: the collection of information relevant to a specific topic on internet, the building of index and the service of information retrieval. Based on topical crawling techniques mentioned above, this dissertation constructs a topical search engine: LookClearTSSE (LookClear Topic-Specific Search Engine). We constructs a topical web crawler LciSpider (LookClear Intelligent Spider) using several crawling strategies introduced in the dissertation to collect topic-relevant information. During the process of crawling, we compute the weights of uncrawled urls extracted from downloaded pages to determine their priorities. In LciSpider, we realize several strategies such as Breadth-First, Best-First, Link-context and Content Block Partition. Then we build incremental index structure for the web pages crawled after pre-processing, and provide information retrieval interface. Before the construct of the incremental index structure, firstly the web pages are pre-processed to get the set of original pages, and then inverted index is built using the result of forward index. This dissertation presents a block-based link structure to store index. This structure makes the time cost in index updating only relate to the quantity and sizes of added pages. The space-for-time approach outperforms continuous storage of index in update rate, and provides much higher query efficiency than native linked list structure, which also supports real-time update.
Keywords/Search Tags:Topic-Specific
PDF Full Text Request
Related items