Research On Index Models For Big Data Query

Posted on:2017-05-15

Degree:Master

Type:Thesis

Country:China

Candidate:C Y Zhu

Full Text:PDF

GTID:2308330485480014

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of computer and Internet technologies, the amount of data has expanded rapidly, and the type of data has become extremely rich. Traditional data models and index technologies have been unable to satisfy the requirements of big data management. Therefore, according to the characteristics and requirements of big data, doing a research on index technology by referring traditional index concept has become an important topic.One of the Big Data characteristics is Variety, that means the data in organizations is no longer just traditional structured relational data, but also include large amounts of unstructured data from web pages, social media and e-mail, etc. Since the two types of data are heterogeneous, they are often stored and processed separately. However, in many application systems, there are a large number of interrelated heterogeneous data. When users want to search for these data, an index mechanism is urgently needed to unified access to structured and unstructured data quickly. In previous studies, researches on index technology are often conducted for a certain kind of data, whereas research works for heterogeneous data are poor. Therefore, there extremely lack a comprehensive indexing mechanism to solve the problem of querying massive heterogeneous data.In addition to the Variety characteristic, Big Data’s another obvious characteristic is Volumn. To store the large volume of data, a lot of excellent distributed storage and management systems have been emerged, such as Google’s distributed file system GFS, Yahoo’s PNUTS, Hadoop’s HDFS, etc. However, most of them only support simple primary key based queries, and cannot efficiently support user’s various query such as range query and non-primary key query due to lack of the essential index mechanism. Therefore, in order to satisfy the diverse demand for user’s query and improve the query processing efficiency, doing research on index technology for the massive data has become an urgent challenge.To solve the two aspects of problems and challenges presented above, this thesis mainly makes the following work:(1) Proposes an associated index model to solve the problem of unified query on massive heterogeneous data. The index mechanism establishes relationships between structured data and unstructured data by description of the same entities, and then those entities can be used as keywords to create index. The structure of index adopts RDF metadata form to describe the correspondence between entities and structured and unstructured resources, which is widely used on the web. In order to reduce the redundancy of associated index layer and locate relevant resources quickly, the model also introduces the secondary index layer. The secondary index layer is consists of two separate indexes:B+tree index for structured database and inverted index for unstructured document. The associated index model provides a good solution to solve the problem of index separation, and presents a unified interface for hybrid heterogeneous data query. At last, the experiment results show that the index system is not only able to support hybrid query on heterogeneous data effectively, but also improve the accuracy of query results.(2) Proposes a two-level bitmap index model, which applies concise bitmap index scheme to the big data environment. The index model incorporates parallel computing framework MapReduce to build the block-level bitmap index and record-level bitmap index respectively for data stored in the distributed file system. Block-level bitmap index acts as a global bitmap, indicating the presence of an attribute value in each block, so as to avoid querying irrelevant block. Record-level bitmap index acts as a local bitmap, indicating the distribution of attribute values in an internal block, which can help filter out irrelevant records and locate the target tuples quickly. This index scheme avoids reading unnecessary data in two levels, which effectively improves the processing efficiency of massive data. Finally, the experiment results show that the index mechanism not only requires less time overhead and space overhead, but also performs better than no index environment obviously.

Keywords/Search Tags:

big data, index, heterogeneous data, bitmap index

PDF Full Text Request

Related items

1	Bitmap Index As Effective Indexing For Low Cardinality Columns In Data Warehouse
2	Research On Bitmap Index Technology And Application For Massive Data
3	Research On Bitmap Index In Data Warehouse
4	Research And Implementation Of Index Selection Strategy In DWMS
5	RDF Data Storage And Management Based On Compressed Bitmap Index
6	Combining Segmentation Graphs And B+ Tree Cloud Data Indexing Mechanism Research
7	Research And Implementation Of The Bitmap Index In Column-Oriented Data Warehouse
8	Research Of Bitmap Index In Data Warehouse
9	Calculating data warehouse aggregates using range-encoded bitmap index
10	Indexing Techniques Based On Olap Data Warehouse Research