Enterprise Search Engine Based On Lucene

Posted on:2010-12-14

Degree:Master

Type:Thesis

Country:China

Candidate:B Wang

Full Text:PDF

GTID:2178360278966276

Subject:Detection Technology and Automation

Abstract/Summary:

PDF Full Text Request

With the development of IT application process, more and more enterprises have built their own Intranets, in which the volume and variation of data grows very fast. Consequently, it becomes more and more difficult for users to find out information that they are really interested in, it's almost impossible without effective search engine.In-site-search service provided by commercial search engine such as Google can be a choice, however this kind of service is mainly designed to satisfy most enterprises' common demands, so some deficiencies cannot be overcome, for example, ?lack of quantity: commercial search engine will never traverse a site very deeply, furthermore, the spider can only collect HTML page and can not do anything about other data format such as pdf, word and even plain text. ?Can not update in real-time, there is a certain cycle for commercial search engine to update, so some times newly-added data can not be indexed on time; ?Accuracy is also low, as said before, commercial search engine can only collect data through HTML page, it is very difficult to avoid duplication.In order to provide more high-quality searching service, enterprise must develop their own search engine, we call it Enterprise Search Engine (ESE). According to the demand, this paper analyzes the necessity and feasibility of building a ESE, then provides a solution by redeveloping Lucene which is a tiny, efficient, free and open-source software project, and using other technology such as text-extracting, data-basing .etc finally, builds a ESE which can provide searching service upon three kinds of document: word, pdf and html.Firstly, general introduction on the history, theory, evaluating indicator .etc of search engine is provided in this paper, then deep researching work on the core technology of search engine such as Chinese word segmentation, index, search .etc is expounded, and Lucene's architecture and theory of it's analyzer, indexer and searcher is studied mainly. some marginal technology researching has also been done as well, for example, Ajax and DWR framework. Finally, this paper wraps the ICTCLAS, and re-developed Lucene into a ESE which achieves to provide full-text searching service on PDF, WORD and HTML document by useing several technology includes PDFBox, POI, HtmlParser, Ajax, database and Hibernate together.

Keywords/Search Tags:

search engine, Lucene, Chinese word segmentation, ICTCLAS, text-extract, DWR

PDF Full Text Request

Related items

1	The Research And Implementation Of Full-Text Search Engine Based On Lucene
2	The Research And Implementation Of Enterprise Search Engine Based On Lucene
3	Research And Design Of Search Within Application System Based On Lucene
4	Research And Implementation Of A Chinese Full-Text Information Retrieval Technology Based-on Lucene Search Engine
5	The Research And Application On Lucene And Chinese Word Segmentation
6	Chinese Natural Language Search Engine Based On Lucene
7	Research And Implementation Of Subject-oriented Mobile Search Engine Based On Lucene
8	The Research And Implementation Of Search Engine Based On LUCENE
9	Research On Vertical Search Engine Based On SSH And Lucene
10	Research And Application Of Full-text Retrieval Technology Based On Lucene