Font Size: a A A

The Research And Design On Vertical Search Engine Based On Lucene

Posted on:2010-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:H XuFull Text:PDF
GTID:2178360278481266Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and WWW, resources on the Internet become more and more abundant. And people are increasingly dependent on the Internet to study and research.In order to help people get useful information in a broad array of Internet information, various Internet-based information retrieval services came into being and have been developed rapidly. Currently, people search for information on the Internet primarily through BaiDu, Google and other general search engines. The function of these search engines have been strong and they can meet most users'needs. However, as for professional themes, such search engines will be insufficient. The emergence of vertical search engines is specifically for solving this problem.First, in this paper, we discuss the significance, architecture of the vertical search engine and study the vertical engines'core technology including subject relevancy judgments, Chinese Word Segmentation technology, web ranking technology and so on.Second, Lucene package is studied deeply in this paper, and compare the Chinese Word Segmentation which Lucene uses with KTDictSeg, then use KTDictSeg which performs better than the former method to segment word from the extracted document.At last, Witkey information vertical search engine is designed in this paper and there are thee modules in the engine system, that are topical spider module, information extraction module and search and index module. At topical spider module, a general arithmetic of Shark Search is adopted to deal with unprocessed URLs. At the information extraction module, the HtmlParser is adopted to extract information from the extracted web pages. At the search and index module, in view of the defect of the Document score method which is used by Lucene can't reflect the importance of the web pages positions, an improved solution is designed. This solution combines the basic Document score method and the positions of the web pages with the characteristics of the documents themselves, and it improves the precision of ranking and searching.
Keywords/Search Tags:Vertical Search, Topical Spider, Lucene, Information Retrieval, Chinese Word Segmentation
PDF Full Text Request
Related items