Font Size: a A A

Utility-driven Topic Web Mining Algorithm Research

Posted on:2008-03-13Degree:MasterType:Thesis
Country:ChinaCandidate:G Q DuFull Text:PDF
GTID:2178360215471645Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As the emergence and rapid development of the Internet, it has become the world'slargest information repository. However, facing with the huge pool of webs, users if wantto access the necessary information is no longer a simple thing. General search engineshave meet people's needs to some extent, but because of its universal nature, still fail tomeet the different backgrounds, different purposes and different periods of users to searchthe web. The inquiry result which the. users obtained from the web is often a long list,which has contained a lot of repeated and unrelated information, and it is a very difficultthing to find the wanted information. Therefore, when we have to satisfy some advancedor the specialized information retrieval request, we need a specific subject-oriented (orspecific domain), a comprehensive set of web pages, for this, the study of the topic webmining emerged.The general search engine's limitation lies that it attempt to index the entire web andattempt to cover all the services on the theme of the inquiry request, but facing so hugepool of the web, it obviously has more desire than energy. The topical crawler is the coretechnology of the topic web mining. It only covers specific areas related with the theme ofthe web, the principle is to find relevant pages as many as possible and irrelevant pages asfew as possible, so that it can search deeper, the search cycle may be shorter, further more,the users can access to the information more rapidly and more accurately.The main research work of this paper is based on the topical crawler's utility to carryout. technical analysis and research. In the subject search, the topical crawler how to visitthe web, improve efficiency, is one of the hotspot in the research of the topic web mining.And the dynamic, the heterogeneous and the complex nature of the web request the topicalcrawler can fast and efficiently realize the web information extraction, guarantee thetimeliness and validity of the information. The main work is as follows:(1) This chapter first introduces the basic architecture and the work mechanism andthe status of the general search engine, then analyzes the research background, thetask and the present research technology progress of the topic web mining, anddiscusses the essential technology and the main point of the realization of thetopical crawler, finally analyzes the relationship, between the general searchengine and the topic web mining. (2) According to the difference appraisal of the hyperlink, we have made aclassification and system analysis for the topical crawler, and compared itsfeatures, advantages and disadvantages, and summed up three key factors toenhance the topical crawler's search efficiency. Taking into account that thereal-time and the professional demands in the topic web mining will be muchhigher than the general search engines, This paper proposes an incrementalinformation extraction algorithm based on the indexing web page, which candiscover the new increase web pages efficiently and rapidly.(3) Considered that the algorithm based on the hyperlink structure and the algorithmbased on the vector space model have their own limitations and complementary,we has made an improvement to the traditional hyperlink structure algorithm,proposed an hyperlink structure algorithm based on the vector space model. Thealgorithm obtains the out values and the entry values for the web page throughanalysis of the hyperlink on the one hand, and makes a relevant judgmentobjectively and accurately by the anchor text and hyperlink context matchsimultaneously on the other hand, and it has a better performance.(4) Considered that the efficiency of the topical crawler is not high at present,proposed a set of design options for the topical crawler based on the relevance tothe subject and the efficient crawling strategy, and has fully showed that thedesign is feasible, and then has made a detailed analysis and verification for therealize of it. The experimental result indicated that although the topical crawlersare more time-consuming than the general crawlers, it also brings positive effect,it enable the crawling workload reducing to some extent. The page will not beprocessed once it entered the discard-queue, and the general crawlers will processall the pages with no choice, so the accuracy and precision is better than thegeneral crawler.The topic web mining can achieve a higher recall and a higher precision, can satisfysome advanced users or some professionals to meet the information retrieval. At present,the technology of the topical crawler have become a new research direction whichcombine collecting technology with filtering methods, it also will be the research hotspotin the field of the information retrieval, and it has provided a new solution for the usage ofweb information.
Keywords/Search Tags:topic web mining, topical crawler, hyperlink, search engine, vector space model
PDF Full Text Request
Related items