Font Size: a A A

Research On Index Management And File Pretreatment Of Distributed Full-text Retrieval System

Posted on:2016-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:S J DaiFull Text:PDF
GTID:2298330470957883Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the advent of the information age, the size of data get explosive growth, and the unstructured information become increasingly large. Based on two features that the data of network is massive and unstructured, the traditional centralized index is difficult to provide efficient and reliable service, so we need to introduce a distributed full text retrieval technology to process vast amounts of unstructured information.The main object of distributed processing system is the retrieval of text data. We use natural language to retrieval, which is necessary for the various types of data obtained by extracting the text on the network and segmenting the natural language word in order to create a structured index. We also need to establish a distributed index management mechanisms to achieve loading balancing index of each node, data synchronization, distributed queries, and to ensure information security. This thesis studied three aspects of the distributed full text retrieval system,which are text extraction, segmentation of mixed word, management of distributed indexThis thesis designed and implemented a real-time text extraction system,which support a variety of file formats. The system includes real-time file monitoring, file type recognition, encoding recognition and conversion, text extraction. It uses Inotify monitoring data source, adds the file which occured write operation to the task queue, identifies the file type, takes the appropriate program for the file type to do text extraction. The system can extract text contents from the office series of documents, pdf documents, archives, email documents, web pages and other documents, uniform encoding and made them plain text file.This thesis designed and implemented a utensil to segment Chinese and English word. Word segmenter includes Chinese word segmenter, English word segmenter,and mixed word segmenter, they are all based on the Trie-tree structure. When dealing with text, we first use the utensil of mixed word segmentation, if a word is segmented unsuccessful,then according to the current character of language to use the Chinese or English word segmenter. Chinese word segmenter uses half-minus tail matching method of forward maximum matching algorithm to capture ambiguous phrase, and design a mechanism to deal with ambiguity disambiguation. English word segmenter combines Porter stemming algorithm with the dictionary matching method to extract the root of a word. Combination of three word segmenter can deal with the text contains mixed word accurately and efficiently.This thesis also studied a distributed index management platform based on Katta,to manage large-scale index files, and provides a interface to search and a interaction page for user. We developed interface functions from the source code of Katta, established management of task mechanism. The platform can merge index by timer, and use the virtual file system of Zookeeper to resolve conflicts between index updating and clients searching. We use Tomcat to built Web server,mteract with the client by JSP/Servlet technology, optimized search algorithms, which supporting for advanced search,such as paging query and the query conditions, and provide the client with simple search page. We also designed a kind of page caching algorithm to enhance the user experience. The data source is mounted to the Web server so that clients can view the original file from the links of results list, improved search performance, solved the problem of poor user experience.
Keywords/Search Tags:full-text retrieval, distributed computing, text extraction, Chinese wordsegmentation, Katta, Page Caching
PDF Full Text Request
Related items