Research On Index Management And File Pretreatment Of Distributed Full-text Retrieval System

Posted on:2016-09-10

Degree:Master

Type:Thesis

Country:China

Candidate:S J Dai

Full Text:PDF

GTID:2298330470957883

Subject:Control Science and Engineering

Abstract/Summary:

PDF Full Text Request

With the advent of the information age, the size of data get explosive growth, and the unstructured information become increasingly large. Based on two features that the data of network is massive and unstructured, the traditional centralized index is difficult to provide efficient and reliable service, so we need to introduce a distributed full text retrieval technology to process vast amounts of unstructured information.The main object of distributed processing system is the retrieval of text data. We use natural language to retrieval, which is necessary for the various types of data obtained by extracting the text on the network and segmenting the natural language word in order to create a structured index. We also need to establish a distributed index management mechanisms to achieve loading balancing index of each node, data synchronization, distributed queries, and to ensure information security. This thesis studied three aspects of the distributed full text retrieval system,which are text extraction, segmentation of mixed word, management of distributed indexThis thesis designed and implemented a real-time text extraction system,which support a variety of file formats. The system includes real-time file monitoring, file type recognition, encoding recognition and conversion, text extraction. It uses Inotify monitoring data source, adds the file which occured write operation to the task queue, identifies the file type, takes the appropriate program for the file type to do text extraction. The system can extract text contents from the office series of documents, pdf documents, archives, email documents, web pages and other documents, uniform encoding and made them plain text file.This thesis designed and implemented a utensil to segment Chinese and English word. Word segmenter includes Chinese word segmenter, English word segmenter,and mixed word segmenter, they are all based on the Trie-tree structure. When dealing with text, we first use the utensil of mixed word segmentation, if a word is segmented unsuccessful,then according to the current character of language to use the Chinese or English word segmenter. Chinese word segmenter uses half-minus tail matching method of forward maximum matching algorithm to capture ambiguous phrase, and design a mechanism to deal with ambiguity disambiguation. English word segmenter combines Porter stemming algorithm with the dictionary matching method to extract the root of a word. Combination of three word segmenter can deal with the text contains mixed word accurately and efficiently.This thesis also studied a distributed index management platform based on Katta,to manage large-scale index files, and provides a interface to search and a interaction page for user. We developed interface functions from the source code of Katta, established management of task mechanism. The platform can merge index by timer, and use the virtual file system of Zookeeper to resolve conflicts between index updating and clients searching. We use Tomcat to built Web server,mteract with the client by JSP/Servlet technology, optimized search algorithms, which supporting for advanced search,such as paging query and the query conditions, and provide the client with simple search page. We also designed a kind of page caching algorithm to enhance the user experience. The data source is mounted to the Web server so that clients can view the original file from the links of results list, improved search performance, solved the problem of poor user experience.

Keywords/Search Tags:

full-text retrieval, distributed computing, text extraction, Chinese wordsegmentation, Katta, Page Caching

PDF Full Text Request

Related items

1	Research On The Hadoop-based Distributed Full-text Retrieval And Related Technologies
2	Research And Application Of Techniques For Collection And Retrieval On Unstructured Data
3	The Research And Design Of Chinese Full Text Information Retrieval Systems Based On PSO
4	Designandimplementationoffull-Textretrieval Systembasedonxapian
5	Research On The Distributed Indexing Platform And Information Filter In Distributed Full-text Retrieval System
6	Chinese Full Text Retrieval Based On SQL Server 2000
7	Research And Implement Of Distributed Full-text Retrieval System Based On Golang
8	Design And Improvement Of Website Full-text Retrieval System Based On Lucene
9	The Research And Implementation Of Full-text Retrieval System Based On Lucene
10	Application Study Of Lucene Full-text Retrieval On The Network Education Platform