| With the rapid development of computer technology and the Internet, we have accumulated more and more data, which also has brought problems of in-formation overload at the same time. Another trend is unstructured document data has a increasing proportion of all data, but we lack simple tool likes SQL that for database to retrieval unstructured documents. Full-text retrieval system established an easy to query structure upon unstructured documents so that the user can get an ordered list of documents related to their query within an accept-able period of time. Full text retrieval is an important way to solve problem of information overload, thus it has a high application value.The first part of this thesis is to introduce the Hadoop-based Distributed full-text retrieval system. The system includes a data receiver, a indexer and a searcher. The data receiver receives the original documents that distributed by data source and store them on HDFS(Hadoop Distributed File System), then sub-mits an indexing job to the Hadoop cluster according to the rule. Indexer create an distributed index using MapReduce. When complete an indexing task, indexer notify searcher to manage the index. The searcher is responsible for receiving and processing clients’queries, then it distributed search the index blocks, combine the results of each index block to return the list of results. The test shows that the system can achieve full-text retrieval function, but its performance needs to improve.Another part of this thesis about the document preprocessing module which is closely associated with the full-text retrieval system. Document preprocessing module includes file type detection, character encoding detection, text extraction and character encoding conversion. File type detection module uses magic number and filename suffix to identify the type of file, and represents file type with the MIME standard. Character encoding detection module uses encoding scheme and statistical information such as difference character frequencies between different encoding system to identify the character encoding. Text extraction module is based on plugin, we write adaptation functions for different plugins to provide a unified interface for the upper layer, it also takes advantage of multi-process to accelerate the speed of text extraction. Character encoding conversion module uses libiconv to make conversion between ANSI and Unicode, and to achieve simplified Chinese and traditional Chinese conversion using Chinese word segment and the method of look-table. The tests show that all modules can work properly and meet the performance requirements. |