Font Size: a A A

Thematic Networks, Reptile And Design

Posted on:2009-07-23Degree:MasterType:Thesis
Country:ChinaCandidate:L F ZhuFull Text:PDF
GTID:2208360245979180Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With the rapid growth of network resources, to find accurate and relevant information fastly becoming increasingly difficult. At this point, search engines have emerged. The search engine is the most convenient and efficient manner to find information. Gerneral search engine, for all the web information retrieval, because of the network huge size and high-speed response results unsatisfactory. Topic search engine is to further enhance the relevance of information retrieval.In this paper, the object of study is focused crawler. First outlined the development of search engines and reptile research network status and then analyzed the architecture of topic search engine, which is divided into five components: data storage, download module, page pretreatment, page classification and link analysis. And describe the function of each part. In this paper, the concrete work as follows:(1) In the search strategy, combined with content analysis and link analysis, the URL string, anchor text, father pages, sibling pages and other inspired information, a link scoring method is designed.(2) Page pretreatment process, includes Word segmentation, HTML analysis and pages noise elimination. on the basis of cutting some nodes of the page tree, a style tree based noise elimination method is designed to improve the page denoising effect.(3) classified pages, including two stages: feature extraction and caculating the term weight. In feature extraction stage, combined with DF, enhanced CHI and MI term weight, a new feature extraction method is designed, which decrease dimension and improve the classification quality effectively. In term weight stage, combined with information gain, classical TFIDF and the weight of important label, a new feature scoring method is designed, which adapt page classification better.(4) At last, In VC6.0 platform and SQL SERVER2000, a simple focused crawler system is designed which runs well.
Keywords/Search Tags:spider, focused crawler, web noise elimination
PDF Full Text Request
Related items