Font Size: a A A

Research On The Distributed Indexing Platform And Information Filter In Distributed Full-text Retrieval System

Posted on:2016-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y P WangFull Text:PDF
GTID:2298330470457780Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet era, each field of social life is surrounded by data and information. People’s daily behavior is closely related with the Internet. People use Internet to browse news, to connect with other person, and to edit documents, etc. All the data generated from these actions are stored in the Internet. Because of the great change brought by Internet and big data, people acquire a large amount of information every day, also the way people access information has become more complex and diverse. Distributed computing technology and full-text retrieval technology are effective tools to deal with large-scale data problem. Distributed computing technology can cope with large data storage problems. Full-text retrieval technology helps people to retrieve useful information in large-scale data correctly and quickly.This thesis mainly studies on distributed full-text retrieval system, the system can store a large sum of multi-format files and supports full-text retrieval. The system uses a distributed architecture to fulfill file preprocessing, indexing and storage. All the files are stored in the distributed file system. Distributed full-text retrieval system contains all the following structure:File preprocessing module, distributed indexing platform, distributed file storage system, the index management platform which support search function and web search platform. The file preprocessing module and the distributed indexing platform are responsible for the indexing task. The index management platform and web search platform are responsible for the index file management and retrieval tasks. The distributed file storage system is responsible for file storage and management.This thesis studies the distributed indexing platform. The platform has been built based on the Hadoop distributed computing library. The platform can build index file for massive text and documents concurrently. This thesis also studies the basic structure and modules of the distributed indexing platform. Mainly involved the data flow, the runtime and speed, concurrent and sharing mechanism, and the index storage mechanism. This thesis also studies the information filtering structure in file preprocessing module. The information filtering structure can filter files based on keywords. The information filtering structure includes the following structures:the single pattern matching structure, the multi-pattern matching structure, and the And-Or expression matching structure. The basic algorithms associated with each structure have been improved and passed the performance test.
Keywords/Search Tags:full-text retrieval, distributed computing, index, Lucene, Hadoop, information filtering, pattern matching algorithm
PDF Full Text Request
Related items