Font Size: a A A

Study On Big Data Full-text Retrieval

Posted on:2015-08-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y N ShiFull Text:PDF
GTID:2298330467474089Subject:Agricultural mechanization project
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, human society has entered an unprecedented era ofinformation technology, it is the era of big data. In the era of big data, people master data in an explosiverate, and thus a large data storage and analysis of large data become critical processing. Big data is not justdata volume growth, morphological data also undergoing fundamental changes. According to statistics,over80%of internet data is unstructured data. Therefore, the study of how to deal with large-scaleunstructured data, which becomes the only way to resolve the problem that how to help people quickly getvalid information in the era of big data.Full text search field of information retrieval is a very important research direction, it has anunparalleled advantage in unstructured data processing, which is the index of the core technology. Thispaper describes two models with different index structure, namely the B+index based on external memoryand external memory model based on linear hash index, and the model’s performance of the two indexeswere compared experimentally.Firstly, this paper introduces the research background and significance of this subject, as well asresearch status at home and abroad for large data and full-text retrieval system, combined with domesticand foreign research progress of the proposed research, the research objectives and key issues to beaddressed. The concept of big data, the concept of full-text search, the system’s overall architecture designand full-text retrieval system involves key technologies have also been elaborated.Secondly, the paper studies the design and implementation process in two different index structurebased text retrieval system. Entire text retrieval system consists of three modules: the index model buildingblocks, modules and systems design and implementation retrieval storage structure model index buildingblocks, and a detailed description of the design ideas and implementation details of each module. Thewhole system, including the source of the document collection, document preprocessing, forward indexconstruction, sub-block inverted index construction, structural design structural design dictionary files,index files, buffer management mechanism, based on the inverted index B+tree the realization of linearhash inverted index-based implementation, the system retrieves model construction and so on.Finally, the time two different models inverted index index structure complexity and space complexityare two aspects of the study of comparative tests. Specifically for the next retrieval efficiency, indexmaintenance efficiency, and two index modes disk volume occupied by other aspects of the comparativeexperiments, and the experimental results are analyzed. Experimental results show that queries per milliondata-consuming linear hashing74.21%faster than B+tree index, insert per million data-consuming linearhash of2.44times B+tree index, delete every million data-consuming linear hash B+tree index is83.52%,linear hash index file size is B+109.56%tree index file size. Seen from the test results, B+tree index has afaster index build and update rate, and linear hash index with the higher disk space utilization and betterquery performance.
Keywords/Search Tags:big data, text retrieval system, B+tree index, linear hash index
PDF Full Text Request
Related items