Font Size: a A A

Research On Topic-Oriented Web Crawler Based On Page Analysis

Posted on:2011-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:H Y ZhangFull Text:PDF
GTID:2178360305981696Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the rapid growth of Web resources in the vast Internet, it becomes more and more difficult to quickly and accurately search the comprehensive information relevant to the theme users query. As the quality and speed of the search become increasingly demanding, and traditional entire Web crawler crawling subject so broad that it can't guarantee the timeliness and relevance of the information, resulting in it can not meet the user's exact search requirements in specific areas because of the unsatisfactory timeliness and accuracy of its result and search efficiency. Thus, We get into the research on Topic-Oriented Web Crawler which can highly ensure the timeliness, relevance to subject.After studying the types, working principle and development of the existing Web Crawler in the world, this thesis compares and analyses the structure and working principle of the traditional Web crawler and Topic-Oriented Web Crawler, then shows the advantages of Topic-Oriented Web Crawler against the traditional Web crawler's inherent defects. As studying and analyzing the old Vector Space Model (VSM) and the old algorithm about computing the relevance of the page content and the subject, the thesis introduces "HowNet" semantic relevance and semantic analysis theory and then presents a new advanced Vector Space Model based on semantic analysis and the characteristic Web page structure.The thesis focuses on combining and improving semantic analysis and VSM. It combines word sense disambiguation, relevance compute and sememe set extraction of the page to VSM, meanwhile the thesis analyses the semi-structured Web page and points out the feature items in different position of the Web page have different ability to express the page content, Then a new advanced VSM base on Web page is presented, which offers position value and multilayer VSM to partition the Web page into N parts and respectively compute the relevance with position value. The new model with semantic analysis is more suitable for computing relevance of Web page in Topic-Oriented Web Crawler, which improves the accuracy, utilization and efficiency of the Crawler.
Keywords/Search Tags:WebCrawler, Theme, VSM, Relevance
PDF Full Text Request
Related items