Font Size: a A A

Search Engine Optimization Based On Lucene

Posted on:2012-12-11Degree:MasterType:Thesis
Country:ChinaCandidate:Z WenFull Text:PDF
GTID:2178330335950626Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Along with the rapid development of network techniques, the data on the Internet has grown explosively. We now have entered the information age, and how to find the interesting information from the huge amount of the data is now becoming more and more difficult for the common users. To cope with this problem, search engine is the common way. It is usually based on certain strategy:using specific computer program to collect data from the Internet first, providing users with retrieval service once the user input their interested keyword query, then showing the relevant data to the user.. Since the search engine development covers many different fields, such as database, information retrieval, artificial intelligence, natural language processing technique, many commercial companies do not like to provide their search engines as free.Lucene is an open source full text search engine toolkit. It is a search engine framework, and provides query processing engine, indexing engine and parts of text analysis engine. During the internship, my major work is to design a website search engine. So I have the opportunity to study the full text search engine technology, especially the Lucene. For seeing that the Lucene framework could not satisfy the requirements of our website search engine, I redesigned and realized some modules and add them into the Lucene, including the Chinese word segmentation module, Chinese word matching module and the search result sorting module.1. In the Chinese word segmentation module, we first provide an optimized design of the dictionary mechanism, which could effectively narrow the search range of string by using the simple dictionary to improve retrieval efficiency.2. In Chinese word matching module, We implement a hash based Chinese word matching algorithm. It first constructs the Chinese parting words based on the dictionary, then uses prefix matching to find the target word. Compared with the old one in Lucene, the new one is more efficient.3. In the search result sorting module, we propose a hybrid method to improve the satisfaction of the sorted result, which combines position weighting algorithm, PageRank algorithm and direct hit algorithm. The experimental result shows that the hybrid method is more pragmatic.
Keywords/Search Tags:Search Engine, Lucene, Chinese segment, index, sorting algorithm
PDF Full Text Request
Related items