Font Size: a A A

Research On Thematic Extraction And Relevant Degree Algorithm Of Vertical Search Engine

Posted on:2008-07-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y F DuanFull Text:PDF
GTID:2178360278455756Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Search engine is the most important query tool for people to get information from the World Wide Web and the key to the internet research and utilization. But currently, with the WWW information's blasting and multivariant growing up, it comes to be more and more difficult to retrieve relevant information. Due to the complexity of web pages, general search engines have found more difficult to meet users' demands. Hence special and theme-oriented vertical search engine technology is becoming a researching focus and direction.This dissertation focused on the thematic specific Chinese web information accessing and retrieving technology, designed and accomplished a Computer Technical Literature Searcher(CTLS).This dissertation gives a survey about the researching and developing situation of search engine, analyzes the main problems in current specialized search engine, and discusses their primary defects of searching strategies. Aiming at the problem of ambiguities in Chinese word segmentation, this dissertation proposes a pretreatment method for segmenting words in Chinese sentences and implements improved MM algorithm based on the pretreatment. Thus the segmentation system is provided with better effect than MM algorithm in mechanical segemention phase.To figure out a fine selection strategy of search paths for web spider in vertical search engine, this dissertation puts forward a non-greedy V-Page-Rank searching algorithm to guide the dynamic adjustment of download diretion for web spider and download the web pages which might contain correlative topic with high priority, thus effectively achieves the specialization and customization of search engine. Considering the differences between vertical search engine and traditional search engine, this dissertation takes content and architecture of web pages into consideration and proposes a self-adapted classification algorithm called IVSM based on vector space model for correlative filtering in information retrieval.This dissertation puts forword a focused crawling algorithm based on web slicing to overcome the searching difficulties in multi-theme web pages and wipe off the nosiy text. Thus the heuristic information of web crawling can be collected well and truly.In this dissertation, a comparatively perfect vertical search engine designing scheme is proposed and a computer-oriented vertical search engine system call CTLS is implemented. This disseartation also expatiates the design of the distributed Robot system construction for theme-specific resource collecting.Illustrated with CTLS, This dissertation sums up the experience of computer-oriented vertical search engine researching and developing, indicates the foreground of system application, and points out the next research orientation.
Keywords/Search Tags:vertical search engine, thematic extraction, relevant degree, IVSM, V-Page-Rank, focused slice crawling
PDF Full Text Request
Related items