Research And Implementation Of The Vertical Search Engine On Lucene

Posted on:2013-05-04

Degree:Master

Type:Thesis

Country:China

Candidate:M Zhu

Full Text:PDF

GTID:2248330377450028

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the information on the Internet is expanding and the imformation forms is increasingly diversifing, the contents of general search engines need to collect, index and query become more and more, however, faced with the enormous search return a result, it is a very laborious thing for users to find the information which they need. Therefore, a focus on specific areas, a comprehensive and in-depth data, updated in a timely manner, focusing on specialization and structural analysis of the vertical search engine (Vertical Search Engine) came into being, how to accurately and timely return the structured information, as well as how to achieve vertical search engine in particular in the field of application has important significance.In the field of search engines, Chinese word segmentation has a great influence on the search results, because it needs to segment the word both on the process of indexing and on the procedure of the results retrieving. The thesis which in-depth research based on understanding the sub-word, statistics-based sub-word, dictionary-based sub-word, ambiguous word and the processing of unlogged word has designed a segmentation algorithm based on dictionary and statistics. Dictionary-based method use reverse maximum matching. The core dictionary of algorithm is made up of the core dictionary and temporary dictionary. On one hand,it uses two hash storage structure and whole word dichotomy of the first word of the hash lookup technology.On the other hand, it improves search efficiency by taking into account the simple structure and small space occupied.While temporary dictionary adopt word hash in order to simplify the construction and maintenance. A good statistical strategy is essential, it is the key to solve the ambiguity problem and the unlogged word problem. In this thesis, it solves the identification of new words which include unlogged word and ambiguous word by using a statistical strategies of computing the word frequency. Experiments show that the sub-word algorithm improved after statistical learning,and the accuracy maintained at about98%, its performance has been greatly improved and can meet the needs of specific areas of application if select the appropriate corpus statistics and learning.On the base of improved Chinese sub-word, thesis make some needs analysis toward the characteristics of the mobile phone information and the search needs of mobile phone products. In the Eclipse development environment using the Lucene open-source framework to achieve a vertical search engine system for a mobile phone running on the Tomcat server information. The design of the system is as follows: Firstly, to improve the framework of the Heritrix crawler, customing specific class crawl the mobile web in the framework of the open source Heritrix crawle and collecting of mobile phone information in the e-commerce sites on the Internet. Secondly, using regular expressions and HtmlParser to extract web page content, added to the system of Chinese sub-word algorithm for processing the information, at the same time building a mobile phone information thesaurus to establish mobile phone information database and index structures in order to achieve the search function of the receiving user to query informations. Finally, the query results are returned to the user.Through system testing, in this thesis, the design of the phone vertical search engine is able to meet user needs and it can be used in other field for its using of reference.

Keywords/Search Tags:

Vertical search engine, Heritrix, Lucene, HtmlParser, Chinese wordsegmentation algorithm

PDF Full Text Request

Related items

1	Design And Implementation Of A Job Vertical Search Engine Based On Lucene And Heritrix
2	Vertical Search Engine For Mobile Phone Information
3	Research And Implementation Of Vertical Search Engine On AEP Based On Lucene
4	Research And Implementation Of The Vertical Search Engine System Based On JAVA With LUCENE And HERITRIX
5	The Design And Implement Of A Vertical Search Engine Based On Second-filtering
6	Research Heritrix And Vertical Search Engine Based On Lucene
7	Research And Implementation Of Vertical Search Engine For E - Commerce
8	Research On Key Technology Of Vertical Search Engine
9	Design And Implementation Of Vertical News Search Engine Based On Heritrix
10	The Research And Design Of The Vertical Search Engine For The Family Medicine In Common Use