Font Size: a A A

Research On Distributed Index Construction Method For Data Space

Posted on:2022-09-06Degree:MasterType:Thesis
Country:ChinaCandidate:P LiuFull Text:PDF
GTID:2518306353477334Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of science and technology,the amount of data faced by the data management system is increasing.Traditional relational databases are gradually unable to meet the rapidly increasing amount of data.And data is often not composed of a single data source,but distributed in various data sources.The data format and semantic relationship between each data source are different.Users need to consume a lot of time and I/O resources to process the data,and cannot quickly obtain valuable information from multi-source heterogeneous data.In order to quickly adapt to this multi-source heterogeneous data environment,data space,a new data management model,can be used to solve current difficulties.Taking the personal data space management system as an example,users no longer need to pay attention to the underlying complex and changeable data formats and data semantic relationships,and can directly and efficiently obtain valuable information from the data.Inverted indexes are widely used in actual information retrieval systems,and how to use inverted indexes to quickly obtain valuable data from multi-source heterogeneous data in the data space is the focus of current index architecture research.This paper analyzes and studies a variety of index architectures,and proposes a distributed index architecture method based on query records.Mining the user's historical query records,clustering high-frequency search terms with a controllable size and load according to the user's query preferences,and dynamically assigning high-frequency words and cache copies to each process according to the different processing capabilities of each node After the index query strategy and query records are accumulated to a certain extent,the dynamic update adjustment strategy of the inverted index partition strategy ensures the load balance among the processor nodes of the distributed index system and improves the parallel retrieval capability.After dividing the inverted index into each processor node,as the amount of data in each processor node cumulatively increases,there are too many token words in each node,and the length of each inverted list is too long,which reduces query performance.Based on the research and analysis of the horizontal partition and vertical partition of the traditional inverted index,this paper proposes a hybrid partition index based on frequent pattern mining.This paper proposes a new data structure dynamic frequent pattern tree and the corresponding creation and update adjustment algorithm to improve the traditional frequent pattern mining algorithm FPgroup,which improves the occurrence of frequent itemsets and infrequent items when data is updated.The performance of the structure update caused by the set conversion.The appropriate token words are excavated from the dynamic frequent pattern tree for vertical division,and then horizontal division is performed on the basis of the vertical division to construct an inverted index of mixed division to improve the efficiency of the index.
Keywords/Search Tags:Data space, Inverted index, Load balancing, Partition index
PDF Full Text Request
Related items