Font Size: a A A

The Extension Language King Figure Focused Crawling Extractor Experimental Studies

Posted on:2006-10-20Degree:MasterType:Thesis
Country:ChinaCandidate:D S LiFull Text:PDF
GTID:2208360155468169Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The birth of the Internet has changed the way information publishing and organization. On the Internet, a user logs on a Web site, browses web pages, and downloads information he or she is interested in. However, due to the lack of the unified information organization standard, it is difficult to acquire precise on-topic information because of dynamic and high-rate information growth. Users are hard to locate information.Portal sites, such as Yahoo, are the most common information sources providing category service. But the category is too loose to meet the requirement of experts and scholars. The information available is either superficial or irrelevant.Search engines have solved information location to some extent. But the first generation of search engines, such as AltaVista, provide full text index. Its ranking strategy is based on the cosine similarity between a query vector and a document vector. This strategy is local and it is hard to collect on-topic information because too many results will return with considerable randomness.The latest search engine, Google, has solved the problem with its global PageRank algorithm to some extent. But one of the Google's targets is coverage, which contradicts with freshness, its second target. The PageRank values will wait for three months to get into the ranking system. Thus, it is impossible to obtain on-topic, up-to-date information with generic search enginesThe focused crawler has complemented the deficiency of the general search engines. By using machine learning technology, it automatically collects information from the Internet as specified by users. It features fast response, high information quality, and intelligent, automatic working mode. It fits for scientists and engineers to collect and query for on-topic information in a specific technical field, especially suits for meeting the requirement of them collecting and querying for information within a specific engineering field in a specific R&D process.The context graph (CG) crawler is a focused crawler developed in recent years. By analyzing the contents of a web page and links, the crawler advances along the most promising path that leads to target documents at low cost of crawling irrelevant pages to find more target documents. This feature is very useful to the on-topic information collection and on-topic research.One of the shortcomings of the CG focused crawlers is low informationavailability. To solve the problem, an Extended Context Graph (ECG) is suggested using links in seeds documents to construct a right layer for collecting Web pages similar to that linked by the seeds. The ECG crawler gets its corpus through a random crawling process and its feature terms are selected by using TF-IDF formula. Then the NB classifier in each of its layers is trained with the class document. The implemented ECG crawler prototype obtains seeds from a pre-implemented meta search module. The trained NB classifiers are then used to predict how many steps away from a target document the downloaded document is likely to be. To evaluate the feasibility of the ECG crawler, an experiment has been conducted using a bad start seed and good start seeds.The test shows an ECG crawler has higher information availability and faster retrieved url elimination rate; while a CG crawler has higher harvest rate. With an ECG crawler, both focused information and fairly relevant information are obtained; while in scenario of a CG crawler, some useful Web pages are classified into the"other"layer just for continuous crawling purpose. Other test discovered that byconstructing an ECG properly, the ECG crawler can be used to crawl Web sites with similar structures to collect papers presented on international conferences and courseware of foreign universities posted on the Internet.This machine-learning-based Web crawler prototype has many advantages over traditional information retrieval tools. First, it is an automatic information search tool run on the Internet. Users can deliver the documents he has collected to the software as training documents, and then has it start to run; second, the documents are searched and matched based on the content; third, when it has collected a preset number of documents that meet the precision requirement, it will return a result list; then download the documents fully at certain server levels as specified by the user.
Keywords/Search Tags:Focused crawling, Extended context graph, Machine learning, Naive Bayesian classifier, On-topic information service
PDF Full Text Request
Related items