Font Size: a A A

Research And Implementation Of The Vertical Search Engine On Lucene

Posted on:2013-05-04Degree:MasterType:Thesis
Country:ChinaCandidate:M ZhuFull Text:PDF
GTID:2248330377450028Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the information on the Internet is expanding and the imformation forms is increasingly diversifing, the contents of general search engines need to collect, index and query become more and more, however, faced with the enormous search return a result, it is a very laborious thing for users to find the information which they need. Therefore, a focus on specific areas, a comprehensive and in-depth data, updated in a timely manner, focusing on specialization and structural analysis of the vertical search engine (Vertical Search Engine) came into being, how to accurately and timely return the structured information, as well as how to achieve vertical search engine in particular in the field of application has important significance.In the field of search engines, Chinese word segmentation has a great influence on the search results, because it needs to segment the word both on the process of indexing and on the procedure of the results retrieving. The thesis which in-depth research based on understanding the sub-word, statistics-based sub-word, dictionary-based sub-word, ambiguous word and the processing of unlogged word has designed a segmentation algorithm based on dictionary and statistics. Dictionary-based method use reverse maximum matching. The core dictionary of algorithm is made up of the core dictionary and temporary dictionary. On one hand,it uses two hash storage structure and whole word dichotomy of the first word of the hash lookup technology.On the other hand, it improves search efficiency by taking into account the simple structure and small space occupied.While temporary dictionary adopt word hash in order to simplify the construction and maintenance. A good statistical strategy is essential, it is the key to solve the ambiguity problem and the unlogged word problem. In this thesis, it solves the identification of new words which include unlogged word and ambiguous word by using a statistical strategies of computing the word frequency. Experiments show that the sub-word algorithm improved after statistical learning,and the accuracy maintained at about98%, its performance has been greatly improved and can meet the needs of specific areas of application if select the appropriate corpus statistics and learning.On the base of improved Chinese sub-word, thesis make some needs analysis toward the characteristics of the mobile phone information and the search needs of mobile phone products. In the Eclipse development environment using the Lucene open-source framework to achieve a vertical search engine system for a mobile phone running on the Tomcat server information. The design of the system is as follows: Firstly, to improve the framework of the Heritrix crawler, customing specific class crawl the mobile web in the framework of the open source Heritrix crawle and collecting of mobile phone information in the e-commerce sites on the Internet. Secondly, using regular expressions and HtmlParser to extract web page content, added to the system of Chinese sub-word algorithm for processing the information, at the same time building a mobile phone information thesaurus to establish mobile phone information database and index structures in order to achieve the search function of the receiving user to query informations. Finally, the query results are returned to the user.Through system testing, in this thesis, the design of the phone vertical search engine is able to meet user needs and it can be used in other field for its using of reference.
Keywords/Search Tags:Vertical search engine, Heritrix, Lucene, HtmlParser, Chinese wordsegmentation algorithm
PDF Full Text Request
Related items