Font Size: a A A

Enterprise Search Engine Based On Lucene

Posted on:2010-12-14Degree:MasterType:Thesis
Country:ChinaCandidate:B WangFull Text:PDF
GTID:2178360278966276Subject:Detection Technology and Automation
Abstract/Summary:PDF Full Text Request
With the development of IT application process, more and more enterprises have built their own Intranets, in which the volume and variation of data grows very fast. Consequently, it becomes more and more difficult for users to find out information that they are really interested in, it's almost impossible without effective search engine.In-site-search service provided by commercial search engine such as Google can be a choice, however this kind of service is mainly designed to satisfy most enterprises' common demands, so some deficiencies cannot be overcome, for example, ?lack of quantity: commercial search engine will never traverse a site very deeply, furthermore, the spider can only collect HTML page and can not do anything about other data format such as pdf, word and even plain text. ?Can not update in real-time, there is a certain cycle for commercial search engine to update, so some times newly-added data can not be indexed on time; ?Accuracy is also low, as said before, commercial search engine can only collect data through HTML page, it is very difficult to avoid duplication.In order to provide more high-quality searching service, enterprise must develop their own search engine, we call it Enterprise Search Engine (ESE). According to the demand, this paper analyzes the necessity and feasibility of building a ESE, then provides a solution by redeveloping Lucene which is a tiny, efficient, free and open-source software project, and using other technology such as text-extracting, data-basing .etc finally, builds a ESE which can provide searching service upon three kinds of document: word, pdf and html.Firstly, general introduction on the history, theory, evaluating indicator .etc of search engine is provided in this paper, then deep researching work on the core technology of search engine such as Chinese word segmentation, index, search .etc is expounded, and Lucene's architecture and theory of it's analyzer, indexer and searcher is studied mainly. some marginal technology researching has also been done as well, for example, Ajax and DWR framework. Finally, this paper wraps the ICTCLAS, and re-developed Lucene into a ESE which achieves to provide full-text searching service on PDF, WORD and HTML document by useing several technology includes PDFBox, POI, HtmlParser, Ajax, database and Hibernate together.
Keywords/Search Tags:search engine, Lucene, Chinese word segmentation, ICTCLAS, text-extract, DWR
PDF Full Text Request
Related items