Research On The Hadoop-based Distributed Full-text Retrieval And Related Technologies

Posted on:2015-02-19

Degree:Master

Type:Thesis

Country:China

Candidate:Y Su

Full Text:PDF

GTID:2268330431450046

Subject:Control theory and control engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of computer technology and the Internet, we have accumulated more and more data, which also has brought problems of in-formation overload at the same time. Another trend is unstructured document data has a increasing proportion of all data, but we lack simple tool likes SQL that for database to retrieval unstructured documents. Full-text retrieval system established an easy to query structure upon unstructured documents so that the user can get an ordered list of documents related to their query within an accept-able period of time. Full text retrieval is an important way to solve problem of information overload, thus it has a high application value.The first part of this thesis is to introduce the Hadoop-based Distributed full-text retrieval system. The system includes a data receiver, a indexer and a searcher. The data receiver receives the original documents that distributed by data source and store them on HDFS(Hadoop Distributed File System), then sub-mits an indexing job to the Hadoop cluster according to the rule. Indexer create an distributed index using MapReduce. When complete an indexing task, indexer notify searcher to manage the index. The searcher is responsible for receiving and processing clientsâ€™queries, then it distributed search the index blocks, combine the results of each index block to return the list of results. The test shows that the system can achieve full-text retrieval function, but its performance needs to improve.Another part of this thesis about the document preprocessing module which is closely associated with the full-text retrieval system. Document preprocessing module includes file type detection, character encoding detection, text extraction and character encoding conversion. File type detection module uses magic number and filename suffix to identify the type of file, and represents file type with the MIME standard. Character encoding detection module uses encoding scheme and statistical information such as difference character frequencies between different encoding system to identify the character encoding. Text extraction module is based on plugin, we write adaptation functions for different plugins to provide a unified interface for the upper layer, it also takes advantage of multi-process to accelerate the speed of text extraction. Character encoding conversion module uses libiconv to make conversion between ANSI and Unicode, and to achieve simplified Chinese and traditional Chinese conversion using Chinese word segment and the method of look-table. The tests show that all modules can work properly and meet the performance requirements.

Keywords/Search Tags:

full-text retrieval, distributed computing, Hadoop, file type detec-tion, character encoding detection and transformation, text extraction

PDF Full Text Request

Related items

1	Research On Index Management And File Pretreatment Of Distributed Full-text Retrieval System
2	Research On The Distributed Indexing Platform And Information Filter In Distributed Full-text Retrieval System
3	The Research And Implementation Of Full-text Retrieval System Based On Lucene
4	Full-Text Search Technology Research And Application In "2008 Olympic Games" Multi-Language System
5	Research And Application Of Full Text Retrieval Based On Hadoop
6	Research On File Preprocessing Technology In Full-text Retrieval System
7	The Research And Development Of Distributed Web Text Retrieval System Based On Hadoop
8	Design And Implementation Of Commercial Bank Big Data Retrieval Platform Based On ELK
9	Chinese Full Text Retrieval Based On SQL Server 2000
10	Research On Full-Text Retrieval Technology For The Single Chinese Character