Font Size: a A A

The Application Researcher Of Full-text Electronic Archives Search Based On Lucene Chinese Segmentation

Posted on:2016-09-22Degree:MasterType:Thesis
Country:ChinaCandidate:Z N QuFull Text:PDF
GTID:2308330470478585Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Now the electronic archives basicly provides user information retrieval functions, but most of the electronic archives systems only support keyword matching search and the information in the database only.For those information stored on disk by files,this kind of systems is powerless.Some of them use the basic full text search function of the database, but the retrieval result is difficult to make customesr satisfied.The work of this paper is planned for the project of the electronic records management system of Panasonic. The files of this system is stored in the Blu ray disc with the format of word, PDF and TXT, etc.. Project requires to achieve the function of full text search.However,the existing search engines are not suitable for the project. Lucene is an open source full-text search engine toolkit with a complete index engine and search engine. Therefore, this paper uses Lucene to develop the project’s exclusive full-text retrieval system.The work of this paper is mainly focused on the Chinese word segmentation module and index module. For bad surppot effect of Lucene on Chinese word segmentation support, this paper makes work as follows:the forward and reverse word segmentation method; joined the part of speech tagging module, The imporvement of ambiguity in the processing and not login words processing effect; Add the database of names, place names as the link Thesaurus, to further improve the segmentation accuracy. Because the full-text search is for computer related documents, so in the index module, this article do optimizations as follows:(1) Improve the indexing dictionary file structure, all the key words being classified as a computer professional vocabulary and non computer professional vocabulary. In the search, the index dictionary file is stored in memory, and the response time is reduced, (2) The values of the index files is set up. Lucene does not have the function of setting the value of the index document. In order to improve the search results, this paper sets up the index file to set different values; (3) To change the index. Lucene index method carries out the I/O operation and the index efficiency is low. In this paper, a distributed parallel indexing method is used to build the index file in memory buffer and the disk.In the end of this paper, the efficiency and the time tests of the Chinese word segmentation algorithm based on the construction of the full text retrieval system is proposed. According to the final test results,we can see that the proposed algorithm act well in the accuracy and efficiency. At the same time, the index module is also tested, and the test results show that the optimized index module has a higher efficiency in the search.
Keywords/Search Tags:Full-text Search, Lucene, Chinese Segmentation, Index
PDF Full Text Request
Related items