Font Size: a A A

Research And Application Of Efficient Retrieval Technology In Big Data Environment

Posted on:2018-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:S J RuanFull Text:PDF
GTID:2348330518496937Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Information retrieval technology is the key to help people to obtain their concern from mass data, if the amount of data is like a lake before age of big data, then it is the ocean now, what people what to find is like a needle in a haystack. In recent years, the research on mass data retrieval has been focused on distributed technology and full-text retrieval technology. In this thesis, we focus on the NoSQL technology, distributed memory file system, full-text retrieval technology, as the key algorithms of storage and information retrieval in big data environment. Several distributed techniques and full-text retrieval technology are combined as the foundation of an efficient retrieval sytem in mass data environment. In this thesis, a Chinese segmentation algorithm is proposed to achieve efficient segmentation of Chinese under Chinese environment. In nowadays, network traffic audit log has become a mass data, which producted on the Internet traffic monitoring process by the government departments and enterprises, and this thesis applied the system to real audit data collected from public services place and companies, and provide massive log storage and retrieval services. The contents of the thesis are as follows:1. Research and analysis of efficient storage technology in mass data,including NoSQL technology and distributed memory system. The key technologies and framework of the Alluxio, world's first distributed memory file system, and it implementation. Combine NoSQL technology, distributed memory technology and Hadoop as a model with ability of efficient storage.2. Research key algorithms of full-text retrieval technology and propose a Chinese word segmentation algorithm. This thesis analyzes the key algorithms of full-text retrieval system, and studies on Lucene and Elasticsearch, compares the existing Chinese segmentation algorithms from various aspects, and proposes Chinese word segmentation algorithms which segment words rely on a dictionary based on double array Trie and rule-based ambiguity elimination, and with the ability of out of Vocabulary identify base on N-Gram algorithm. Compared to the performance of the existing word segmentation algorithms and this word segmentation algorithm.3. The efficient store and retrieval system in big data. The system is designed by using multiple distributed techniques and the Chinese word segmentation algorithm, which consists of a storage module based on HBase, Hadoop and Alluxio, and a data retrieval module based on the Chinese word segmentation and Elasticsearch, Redis cluster, and a module that provides the data synchronization between above modules.4. The application of efficient store and retrieval system on massive audit log. This thesis presents an efficient storage and retrieval capacity on real enterprise data, and expatiates the concrete implementation of each module. The experiment proves that the system can handling the massive audit log data storage and retrieval under large data environment very well. The system with high availability, scalability, efficient of data input and output, multiple query support.
Keywords/Search Tags:full-text retrieval, NoSQL technology, Chinese segmentation algorithm, distributed memory file system
PDF Full Text Request
Related items