Font Size: a A A

Research Of Main Technologies Of Vertical Search Engine

Posted on:2011-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:J P FangFull Text:PDF
GTID:2178360305462014Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the explosive growth of information on the Internet, general search engine fail to satisfy special query needs of special user, in this case vertical search engine came into being. In this thesis, several key technologies of vertical search engine are investigated, including focused crawler, web information extraction and purification technology. First of all, a focused crawler based on one-class document classification is designed; and one of the one-class document classification algorithms, the prototype algorithm, is improved by using the least square method to solve the optimal solution of the prototype vector in order to improve the classification accuracy; the experiments show that the improved prototype algorithm significantly improve the classification accuracy. Secondly, two page purification algorithms are proposed based on DIV_DOM model, namely, the heuristic page purification algorithm based on DIV_DOM model and the page purification algorithm base on DIV template tree; and they are applied to the focused crawler in order to evaluate their effects on the performance of the focused crawler; the experiments show that the latter performs better on web purification than the former. Finally, based on the proposed web purification algorithms, another DOM-based web pages structured information extraction technology is proposed in this thesis, and an inductive learning algorithm is used to produce extraction rules automatically; the experiments show that the proposed information extraction algorithms is effective.
Keywords/Search Tags:Vertical Search Engine, Focused Crawler, One-class Document Classification, Web Page Purification, Web Information Extraction
PDF Full Text Request
Related items