Font Size: a A A

Research On Vertical Search Engine System

Posted on:2011-01-30Degree:MasterType:Thesis
Country:ChinaCandidate:C M JinFull Text:PDF
GTID:2178360305982266Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Along with the rapid expansion of Internet information resources, search engine technology is booming. As a web information retrieval tool, search engine can help user figure out a clear retrieval path from not well organized information, locate what you want promptly and precisely. Meanwhile, general search engine has flaws and it is inevitable, it is difficult to collect all the subjects, even though it can do that, it can never reach perfection, there is so much unwanted information. Vertical search engine is a new service mode, which can tackle all the above problems, It provides useful information in a particular area. It is designed to professional field and specialized knowledge, so it ensures timeliness of information, and enhance the recall ratio and the precision ratio for a greater degree.First and foremost, the thesis analyzes vertical search engine, designs its framework, presents the feature of the two open source tool Heritrix and Lucene, and describes the three evaluation of vertical search engine system, they are function, performance and searching effect. The searching effect is judged according to the recall ratio and the precision ratio. After that, web pages acquirement model, web pages pretreatment and index model and user retrieval model are discussed in great detail. The thesis implements a prototype system of vertical search engine, which is about notebook PC. In the web pages acquirement chapter, breadth-first and quality-first strategies are studied, work process and the key components of Heritrix are discussed, and the related component of Heritrix is extended in order to accomplish the personalization acquirement logic, and the original web pages acquirement is finished. In web pages pretreatment and index chapter, the technology of word segmentation, the process of web pages pretreatment, the method of building inverted index and compression algorithm are studied, and the method of building and compression algorithm of inverted index are the key points. The word segmentation and index interface of Lucene are used to build inverted index, and MySQL is used to store records. In user retrieval chapter, the two relevant sorting technologies of vector space model and PageRank model are studied. Multiple fields, multiple indexes search and the implementation approach of searching filter are discussed. The user retrieval interface is implemented on the basis of Lucene retrieval toolkit, and it can meet the requirements. Finally, the author suggests that ontology technology should be applied into vertical search engine area, which can bring deeper comprehending of context.
Keywords/Search Tags:vertical search engine, web page acquirement, inverted index, vector space model
PDF Full Text Request
Related items