Font Size: a A A

Research And Design Of Multi-Source Heterogeneous Data Retrieval System

Posted on:2021-11-14Degree:MasterType:Thesis
Country:ChinaCandidate:L Q DengFull Text:PDF
GTID:2518306554966069Subject:Computer technology
Abstract/Summary:PDF Full Text Request
At present,the research on mass data mainly focuses on the storage,retrieval,mining and analysis of data,and is basically based on a specific application scenario and specific data source.With the rapid development and wide application of the Internet today,a large number of all kinds of data will be generated in a short time,such as tables,text,audio,video and so on,data storage format is also different,have their own characteristics.In reality,there is also a need for centralized and unified storage management of these multi-source heterogeneous data.However,the technical scheme and algorithm under homologous isomorphic data cannot be directly used for the processing of multi-source heterogeneous data.Therefore,it is a great practical significance to study the efficient storage and fast retrieval of multi-source heterogeneous data.The main work of this paper is to study the storage and retrieval of multi-source heterogeneous data,aiming to provide a reference scheme for efficient storage and quick retrieval of multi-source heterogeneous data.The main contents are as follows:(1)To solve the storage problem of multi-source heterogeneous data,this paper first classifies the data according to its characteristics,such as table data and text data,and then converts the different forms of table data into text data through programs and stores them in HBase database.In the process of data writing to the database,the data can be evenly distributed on different HBase slices by pre-partitioning when the table is built in HBase and then the result of hash processing of unified ID generated after unified data transformation is stored as Row Key for data storage,so as to avoid the problem of unbalanced data distribution.(2)Aiming at the problem of multi-source heterogeneous data retrieval,this paper introduces Elasticsearch,a full-text search engine that supports distributed multi-user capability,to make up for the deficiency of HBase in complex multi-field queries.The basic method is as follows: first,build an index library in the Elasticsearch engine for the six fields with the highest query frequency according to the different query frequency of the data fields in HBase;Then,when the index shard is established,the optimal number of index shards is determined by integrating the remaining storage space and the fragmentation size of each node in the Elasticsearch cluster,and the index is established according to the optimal index shard number.Finally,the number of index slices is optimized to improve the system performance in data distribution,data writing efficiency and data query delay.The test and verification results show that a second-level index data retrieval scheme based on HBase storage +Elasticsearch designed in this paper can pre-split the established data table and hash the row key of written data during data storage,in different test data sets and the size of the cluster nodes,and HBase compared by default,the data distribution is relatively balanced,writing data efficiency has improved significantly.When making index creation through optimize the fragmentation index correlation coefficient,determine the number of fragmentation index,compared with the system default 5 subdivision number,the number of index fragmentation after optimization can be dynamically adjusted according to the remaining space of nodes to effectively reduce the data writing time and query delay in the case of different index data volume and the number of cluster nodes.
Keywords/Search Tags:HBase storage, Elasticsearch, data search, multi-source heterogeneous
PDF Full Text Request
Related items