Study On Big Data Full-text Retrieval

Posted on:2015-08-02

Degree:Master

Type:Thesis

Country:China

Candidate:Y N Shi

Full Text:PDF

GTID:2298330467474089

Subject:Agricultural mechanization project

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology, human society has entered an unprecedented era ofinformation technology, it is the era of big data. In the era of big data, people master data in an explosiverate, and thus a large data storage and analysis of large data become critical processing. Big data is not justdata volume growth, morphological data also undergoing fundamental changes. According to statistics,over80%of internet data is unstructured data. Therefore, the study of how to deal with large-scaleunstructured data, which becomes the only way to resolve the problem that how to help people quickly getvalid information in the era of big data.Full text search field of information retrieval is a very important research direction, it has anunparalleled advantage in unstructured data processing, which is the index of the core technology. Thispaper describes two models with different index structure, namely the B+index based on external memoryand external memory model based on linear hash index, and the modelâ€™s performance of the two indexeswere compared experimentally.Firstly, this paper introduces the research background and significance of this subject, as well asresearch status at home and abroad for large data and full-text retrieval system, combined with domesticand foreign research progress of the proposed research, the research objectives and key issues to beaddressed. The concept of big data, the concept of full-text search, the systemâ€™s overall architecture designand full-text retrieval system involves key technologies have also been elaborated.Secondly, the paper studies the design and implementation process in two different index structurebased text retrieval system. Entire text retrieval system consists of three modules: the index model buildingblocks, modules and systems design and implementation retrieval storage structure model index buildingblocks, and a detailed description of the design ideas and implementation details of each module. Thewhole system, including the source of the document collection, document preprocessing, forward indexconstruction, sub-block inverted index construction, structural design structural design dictionary files,index files, buffer management mechanism, based on the inverted index B+tree the realization of linearhash inverted index-based implementation, the system retrieves model construction and so on.Finally, the time two different models inverted index index structure complexity and space complexityare two aspects of the study of comparative tests. Specifically for the next retrieval efficiency, indexmaintenance efficiency, and two index modes disk volume occupied by other aspects of the comparativeexperiments, and the experimental results are analyzed. Experimental results show that queries per milliondata-consuming linear hashing74.21%faster than B+tree index, insert per million data-consuming linearhash of2.44times B+tree index, delete every million data-consuming linear hash B+tree index is83.52%,linear hash index file size is B+109.56%tree index file size. Seen from the test results, B+tree index has afaster index build and update rate, and linear hash index with the higher disk space utilization and betterquery performance.

Keywords/Search Tags:

big data, text retrieval system, B+tree index, linear hash index

PDF Full Text Request

Related items

1	Research Of Mixed Index Of Based On Hash、B+、3DR And B*
2	Reserch Of Topic Detection
3	Research On Seconde Index Technology For XML Data
4	Image Retrieval Method Based On Depth Learning Feature Extraction And Tree-hash Mixed Index
5	Study Of Indexing Techniques For Encrypted Full-Text Retrieval System
6	Research And Implementation Of Distribute Massive Text Data Index And Retrieval System
7	A Research Of Full-Text Retrieval Based On Inverted Index
8	Research Of Index In Chinese Full-text Retrieval System
9	The Approximate Query Research Of Time Series Based On Linear Hash Index
10	Multi-Core Programming For Dynamic Follow-Tree Index Of Full Text Search Research