Font Size: a A A

Research On Key Technologies Of Integration And Query In Dataspace

Posted on:2017-12-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:G W ZhuFull Text:PDF
GTID:1318330518972883Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Over the past ten years,the technologies such as Internet,cloud computing and big data have been flourishing,which make current data emerge some new features such as great volume,variety,dynamic evolution and loose interrelation.Traditional database management technology can not manage such data,therefore it is very necessary to study a new data management technology which can harness such data.To do this,the dataspace technology is proposed and has been paid attentions widespreadly in the database community and the industry.However,there are still many outstanding(or incompletely-solved)problems in the aspects of dataspace integration and dataspace query.For instance,the lack of the data model representing heterogeneous data and complex semantic relationships,the lack of the dataspace entity partitioning technology in the dynamic evolution environment,the lack of the multi-dimensional indexing technology supporting high-skew,large-scale and heterogeneous data,and the lack of the approximate query technology which can seamlessly search heterogeneous data and has strong expressive power.This dissertation focuses on the research on the data integration and data search in dataspace,aiming to uniformly manage the structured,semi-structured and unstructured data and seamlessly retrieve these heterogeneous data.Furthermore,the purpose of this dissertation is to provide basic support for the pay-as-you-go data integration and to provide a best-effort dataspace search service.In view of above mentioned problems,this dissertation will make an in-depth research as follows:Firstly,for the heterogeneous data with context-dependence and the semantic relationships with complexity in dataspace,a representation model in dataspace is studied in this doctoral thesis.After analysing the drawbacks of the conventional dataspace model(i.e.,interpreted object model)in a case manner,we propose a context-aware and complex semantic association network model,namely,COS AN model.In particular,(1)On the basis of traditional interpreted object model,a representation method for the context-aware heterogeneous data is formally defined by considering the context-awareness.This representation method can encapsulate the structured,semi-structured and unstructured information as the context-aware interpreted objects,in order to well represent context-aware heterogeneous information.(2)To overcome weakness of traditional data model only expressing simple binary semantic associations,the binary semantic associations are extended through a set of constraint components like context constraints,order constraints,and aggregation constraints,aiming to express complex semantic relationships.(3)The experimental results on public dataset(DBLP)demonstrate the feasibility and effectiveness of our proposed data model.Secondly,for dataspace entities with the richness in information,the lag in the categories and the evolution over time,the research on the technology of the entity partitioning in dataspace is conducted in this dissertation,and we present a dataspace entity partitioning approach based on the evolutional k-means clustering.Specifically,(1)An evolutional k-means clustering framework is put forward based on the silhouette value and KL-Divergence,which considers not only current cluster quality(namely,snapshot cost),but also the temporal smoothness from several classical historical clusters(namely,history cost).(2)Combining rich intra-entity information with inter-entities historical occurrence patterns,we devise a similarity measurement for dataspace entities,in order to calculate the similarity between entities more accurately.(3)In accordance with a heuristic rule,a similarity density-based evolutional k-means clustering algorithm is proposed,aiming to well solve the problems of the initial center point selection and the dataspace entity partitioning.(4)In order to handle the case where the number of clusters changes with time and the case where new snapshot entities are inserted and old ones are removed over time,previous proposed evolutional k-means clustering framework is further extended.(5)The extensive experimental results on real dataset(DBLP)show that our method is superior to the state-of-the-art approaches,and it can not only capture current high-quality clustering,but also robustly reflect historical cluster memberships.Thirdly,according to the problem that the existing dataspace indexing approaches are not fit for the high-skew and large-scale data,a multi-dimensional dataspace indexing technology is studied in this doctoral thesis from the view of load balancing and partition.Then a multi-dimensional dataspace indexing method is proposed based on the load balancing and query logs,aiming to keep the load balancing among all indexing sites,reduce the communication cost among them,and further improve dataspace query performance.In particular.(1)When the indexing is partitioned vertically,the token terms appearing frequently in the query logs and entities are aggregated,in order to decrease the aggregated(or joint)cost from the inverted lists involving user queries.And on this basis,combining the hypergraph theory with the access pattern information between user queries and inverted lists,we reduce the vertical partition problem as the hypergraph partition problem to keep the load balancing of vertical partition indexing.(2)When the indexing is partitioned horizontally,combining the hypergraph theory with the access pattern information between user queries and the entities,we reduce the horizontal partition problem as the hypergraph partition problem to keep the load balancing of horizontal partition indexing.(3)In terms of the vertical partition strategy and the horizontal partition strategy,a two-dimensional hybrid indexing is built.Furthermore,it is extended as a three-dimensional indexing by means of utilizing the index replication strategy from the perspective of the query throughput and fault tolerance.(4)We conduct the extensive experiments on the real dataset(DBLP),and the results show that our approach is is better than the state-of-the-art approaches in the aspects of the throughput,query response time and scalability.Finally,for the demerits that the query semantics or query structures of traditional dataspace query methods are relatively simple,the technology of top-k approximate subgraph query in dataspace is studied in this dissertation.Then a neighborhood-based top-k approximate subgraph query approach is proposed.Specifically,(1)A problem of top-k approximate subgraph query in dataspace is formally defined.Then a novel dataspace query language,denoted as GQL,is put forward based on the theory of the graph management.(2)By employing the neighborhood information of vertex distances and the distribution information of edge labels,we present a neighborhood-based graph similarity function.(3)Based on above proposed dataspace indexing and the neighborhood of vertexes,a neighborhood-based vertex matching algorithm is proposed to prune those unpromising candidate matched vertexes.(4)By considering the pruning strategy and matching order of the vertexes,we put forward a top-k approximate subgraph search algorithm.(5)The extensive experimental results on real dataset(DBLP)show that our approach outperforms the state-of-the-art approaches in the aspects of the effectiveness,efficiency and scalability.
Keywords/Search Tags:Dataspace, Data model, Entity partitioning, Multi-dimensional indexing, Top-k approximate subgraph query
PDF Full Text Request
Related items