Font Size: a A A

The Study And Implementation Of Topic Search And Web Mining

Posted on:2010-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y X LiuFull Text:PDF
GTID:2178330332488563Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, mass Web resources become an important source for people to acquire knowledge and information. However, because of the rapid expansion of Web resources, it is difficult for users to obtain the information quickly and accurately. So the search engines become the most widely used information retrieval tools. At present, the service provided by most search engines cannot satisfy the people's needs. Due to the characters of Web resources, such as half-structure, discrete, real-time and heterogeneous, how to carry out mining analysis on them for getting real effective information and providing custom-made service to the users has already become an important research subject.This study focuses on topic search and Web mining. Based on the design and implementation of prototype system-BlueSpider, the key technology of search engine and Web mining is emphatically discussed. The main research tasks can be described as follows:Focused Web crawler:The existing crawler search algorithms are analyzed; the search strategy available is improved; and a search algorithm based on non-greedy policy is proposed.Web information extraction:The method for obtaining the Web information by traversing the HTML document tree is presented, in which way the information in the Web page can be obtained rapidly, flexibly and effectively.Web document analysis:According to the specific semi-structure and non-uniform coding features of the Web document, the corresponding methods are proposed, including transcoding, Chinese segmentation etc. Moreover, the method for calculating the weight of feature items is improved.In addition, we give the methods which are needed by topic search for determining the correlation degree of the Web pages and URLs with the subject, and a novel clustering algorithm is proposed to analyze the Web pages.Based on above research achievements, details of prototype system design and implementation are described in this study.
Keywords/Search Tags:Web mining, topic search, Web crawler, Chinese segmentation
PDF Full Text Request
Related items