Font Size: a A A

Research On Large-scale Graph Data Similarity Query And Classification Techniques

Posted on:2017-09-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:J PanFull Text:PDF
GTID:1310330542477142Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Graph is a representation model describing objects and the relationships among objects.All produced structured and unstructured relation data can be directly or indirectly described as graph models in academia and industry.The target of graph similarity query is to find all similar graph pairs from two different graph datasets,or to find all graphs in a given graph dataset,that are similar with the query graph.The former is called as the graph similarity join query,and the latter is known as the graph similarity search query.The target of graph classification is to learn classification models from training graph datasets which are used to predict the class label of an unlabeled graph.Both graph similarity query and classification techniques have important practical applications in many fields,such as chemistry,bioscience and social networks.Although graph similarity query and analysis techniques are very import in practical applications,the corresponding research works face two severe challenges.The first challenge is caused by the multi-graph complexity.Multi-graph is a representation model describing combinational relationships between objects and their components.A multi-graph is a set consisting of many graphs.Although multi-graph similarity query and analysis techniques are very import,it is not trivial to solve the multi-graph similarity query and classification problems because of the multi-graph structural complexity.The second challenge is caused by the large-scale property of graph datasets.Because existing processing techniques for large-scale graph data focus on processing a single big graph,these techniques are less help to be directly utilized for processing a large amount of graphs or multi-graphs.In addition,conducting graph or multi-graph similarity queries and classification over massive graphs desires distributed computation solutions,which faces many technical challenges.In this dissertation,we aim to solve the actual problems of graph similarity query and classification from academia and industry.To be specific,we analyze the challenges faced by graph and multi-graph similarity queries and analysis,study the similarity query and classification problems for a large number of graphs or multi-graphs,and propose some efficient solutions which meet the actual demands of academia and industry.Our main contributions are summarized as follows.(1)We propose efficient and effective solutions to solve the problem of large-scale graph similarity join query.We design a scalable prefix filtering method,which fits big q-gram alphabet.We propose a parallel algorithm,i.e.MR-GSimJoin,based on a scalable prefix filtering technique and the MapReduce framework to solve the large-scale graph similarity join query problem.Meanwhile,multiple techniques are adopted in our solution framework to optimize the MR-GSimJoin algorithm,including the compressing technique,the two rounds of data access technique and the composite key technique.At last,the effectiveness and efficiency are verified by a series of experiments.(2)We propose corresponding solutions to solve the large-scale multi-graph similarity search query problem.We propose a multi-graph distance metric and optimize the algorithm of calculating the distance.In addition,we design an incremental multi-layer inverted index and multiple low bound pruning strategies.Based on the multi-layer inverted index and low bound pruning strategies,we propose an efficient algorithm,called MGSS,to solve multi-graph similarity search query problem.Meanwhile,based on MGSS and the MapReduce framework,we derive a parallel algorithm,named MR-MGSS,to solve large-scale multi-graph similarity search query problem.The localization strategy is used to optimize the MR-MGSS algorithm.Our optimized strategy not only reduces the communication cost,but also solves the load imbalance problem of map task to a certain extent.Finally,a serial of experiments are conducted to evaluate the effectiveness and efficiency of the proposed MGSS and MR-MGSS.(3)For the supervised large-scale multi-graph classification problem,we design a couple of effective solutions.We propose an algorithm based on the MapReduce framework,called ME-MGC,to solve the supervised large-scale multi-graph classification problem.Both inverted index technique and reuse technique are applied to improve the efficiency of ME-MGC.Extreme learning machine,i.e.ELM,is adopted to improve the classification performance.Meanwhile,the classification performance with different hidden node numbers of ELM is studied.Finally,we evaluate the effectiveness and efficiency of ME-MGC through a series of experiments.(4)We propose some solutions to process the semi-supervised large-scale multi-graph classification problem.We design a score function to evaluate feature subgraph values by considering labeled multi-graphs,unlabeled multi-graphs and two layers label constraints properties.Based on this score function,MGSSL algorithm is proposed to solve the semi-supervised small-scale multi-graph classification problem.An upper bound pruning strategy is derived to improve the efficiency of MGSSL.Then,we propose an algorithm based on MGSSL and the MapReduce framework,named MR-MGSSL,to solve the semi-supervised large-scale multi-graph classification problem.Both the reuse and the localization strategies are utilized to optimize MR-MGSSL.Finally,a series of experiments are conducted to verify the effectiveness and efficiency of MGSSL and MR-MGSSL.To sum up,in this dissertation,we study the large-scale graph similarity join query problem,large-scale multi-graph similarity search query problem,supervised large-scale multi-graph classification problem and semi-supervised large-scale multi-graph classification problem.Some efficient solutions are proposed.Experimental results show that our proposals have better query processing performances and higher classification accuracy than the state-of-the-art solutions.
Keywords/Search Tags:graph, multi-graph, large-scale, similarity queries, classification, MapReduce
PDF Full Text Request
Related items