Font Size: a A A

Research On Index Models For Big Data Query

Posted on:2017-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:C Y ZhuFull Text:PDF
GTID:2308330485480014Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer and Internet technologies, the amount of data has expanded rapidly, and the type of data has become extremely rich. Traditional data models and index technologies have been unable to satisfy the requirements of big data management. Therefore, according to the characteristics and requirements of big data, doing a research on index technology by referring traditional index concept has become an important topic.One of the Big Data characteristics is Variety, that means the data in organizations is no longer just traditional structured relational data, but also include large amounts of unstructured data from web pages, social media and e-mail, etc. Since the two types of data are heterogeneous, they are often stored and processed separately. However, in many application systems, there are a large number of interrelated heterogeneous data. When users want to search for these data, an index mechanism is urgently needed to unified access to structured and unstructured data quickly. In previous studies, researches on index technology are often conducted for a certain kind of data, whereas research works for heterogeneous data are poor. Therefore, there extremely lack a comprehensive indexing mechanism to solve the problem of querying massive heterogeneous data.In addition to the Variety characteristic, Big Data’s another obvious characteristic is Volumn. To store the large volume of data, a lot of excellent distributed storage and management systems have been emerged, such as Google’s distributed file system GFS, Yahoo’s PNUTS, Hadoop’s HDFS, etc. However, most of them only support simple primary key based queries, and cannot efficiently support user’s various query such as range query and non-primary key query due to lack of the essential index mechanism. Therefore, in order to satisfy the diverse demand for user’s query and improve the query processing efficiency, doing research on index technology for the massive data has become an urgent challenge.To solve the two aspects of problems and challenges presented above, this thesis mainly makes the following work:(1) Proposes an associated index model to solve the problem of unified query on massive heterogeneous data. The index mechanism establishes relationships between structured data and unstructured data by description of the same entities, and then those entities can be used as keywords to create index. The structure of index adopts RDF metadata form to describe the correspondence between entities and structured and unstructured resources, which is widely used on the web. In order to reduce the redundancy of associated index layer and locate relevant resources quickly, the model also introduces the secondary index layer. The secondary index layer is consists of two separate indexes:B+tree index for structured database and inverted index for unstructured document. The associated index model provides a good solution to solve the problem of index separation, and presents a unified interface for hybrid heterogeneous data query. At last, the experiment results show that the index system is not only able to support hybrid query on heterogeneous data effectively, but also improve the accuracy of query results.(2) Proposes a two-level bitmap index model, which applies concise bitmap index scheme to the big data environment. The index model incorporates parallel computing framework MapReduce to build the block-level bitmap index and record-level bitmap index respectively for data stored in the distributed file system. Block-level bitmap index acts as a global bitmap, indicating the presence of an attribute value in each block, so as to avoid querying irrelevant block. Record-level bitmap index acts as a local bitmap, indicating the distribution of attribute values in an internal block, which can help filter out irrelevant records and locate the target tuples quickly. This index scheme avoids reading unnecessary data in two levels, which effectively improves the processing efficiency of massive data. Finally, the experiment results show that the index mechanism not only requires less time overhead and space overhead, but also performs better than no index environment obviously.
Keywords/Search Tags:big data, index, heterogeneous data, bitmap index
PDF Full Text Request
Related items