Font Size: a A A

"Luder" Content Based Document Search Engine

Posted on:2008-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y J DouFull Text:PDF
GTID:2178360212486062Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The development of technology in Web Searching quicken the speed of users' searching their useful information in huge Internet and promote the sharing of Internet resources extremely. Mean a while, the same important native document resources as the Web were neglect and these resources couldn't be shared with Web because of their speciality. When user search related document in native file system, they just need browse the directory back and forth then read them, that reduce the searching efficiency dramatically and the native document resources can't be used effectively. This paper research and build a full-text-based desktop document Search Engine. It was based on "Lucene" open source searching framework and resealed the "Lucene"' s kernel function. It can search multiple format of native document and give user a global view of desktop document resources. Constructing the transplantable user interface based on SWT GUI library to interact with user conveniently. A Segmenting module named "MandarinAnalyzer" was built into this system based on the dictionary. It both can support Chinese and English segmenting and can match five maximal Chinese words pattern to solve the problem of "Lucene'"s weak in Chinese supporting. To support most of document format in fashion, multiple-format-supporting parser must be added into the system to extract the text from the document. This system solved the problem of document searching in desktop application effectively, supported searching on content and increased both the efficiency and speed of searching to make use of the desktop document resources effectively.
Keywords/Search Tags:Search Engine, Desktop Searching, Full-Text Retrieval, Inverted Index, Document Format, Lucene, Chinese Segmenting
PDF Full Text Request
Related items