Font Size: a A A

Research On Key Techniques Of Vertical Search Engine Based On Lucene

Posted on:2012-03-30Degree:MasterType:Thesis
Country:ChinaCandidate:D J DengFull Text:PDF
GTID:2178330335452620Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid growth of web information, because search scope of general search engines are all of the information of the web, updating speed of the information is slow, which results in varied and excessive search results. This consequence can't satisfy the demands of specified users, who need accurate and deep search results. Therefore, aiming at specified demands of users in specified domain, vertical search engines emerge. Vertical search engines focus on obtaining the information of specified domain and provide retrieval services of corresponding information.This paper firstly introduces the structure of vertical search engines, stating working process and characteristics, and states the present research situation of Vertical Search Engine's related technologies. Further more, this paper analyzes and discusses some key technologies of vertical search engines, such as the working process of topical crawlers and topical search strategies, extracting algorithm of topical information in web pages and the ranking algorithm of web pages. In the study of extracting algorithm of topical information in web pages, according to the structural features of HTML web pages, HTML web pages are divided into several blocks. Calculate the word numbers of each block, through which can find the most concentrated text districts of the web pages and regard these districts as the main text of the web pages. Thus extract topical information in web pages. The experiment and data analysis shows that the extracting algorithm of topical information in web pages that this paper proposes has better accuracy.This paper analyzes existing Weighted Term Frequency Position Algorithm, HITS Algorithm and PageRank Algorithm and obtains improved PageRank Algorithm. This algorithm uses Cosine Similarity Algorithm to analyze the similarities of linked web pages and adds time factor, which reflects the age of web pages. Improved PageRank Algorithm not only uses the link structure of web pages, but also uses the similarities of linked web pages, which avoid the disadvantages of theme-drift and deviation of old pages in original PageRank Algorithm. The experiment proves that the improved PageRank Algorithm improves the ranking effect greatly.Finally, this paper analyzes and discusses the related technologies of full-text retrieval toolkit Lucene, including the system architecture, indexing mechanism, searching mechanism and scoring mechanism. On the basis of this, this paper designs and implements a small vertical search engine prototype which faces to the teaching and study resources in campus network with Lucene development toolkit. This vertical search engine prototype uses Heritrix to collect information, and use Lucene to implement the index module and searching module. Aiming at the actual demands of this vertical search engine, extend the Chinese word segmentation function of Lucene with Paoding word segmentation tool, and implement resolution of Office documents with Apache POI, including Word documents, PowerPoint documents and Excel documents, implement resolution of pdf documents with Xpdf, and txt documents and Html documents. In the meanwhile, this vertical search engine extends the scoring mechanism of Lucene, and improves the effect of ranking web pages using the improved PageRank Algorithm. According to the tests, this vertical search engine achieves the expected goals.
Keywords/Search Tags:Vertical Search Engine, Lucene, extracting topical information in web pages, ranking algorithm of web pages
PDF Full Text Request
Related items