Font Size: a A A

Research Of Efficient Frequent Subgraph Mining On Big Graph

Posted on:2016-05-07Degree:MasterType:Thesis
Country:ChinaCandidate:G FanFull Text:PDF
GTID:2308330464956803Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In modern life, graph is used to describe the normal things more and more frequently, such as the composition of molecule structure, protein networks, social networks, semantic networks and the description of the biological information network. This all need graph, the graph can show the things more vividly and people can understand it better. Along with the development of the time, the development of the computer industry is faster, and the data scale is more and more big, the scale of the graph data is the same at the same time, maybe bigger. In the environment of the large scale of graph data nowadays, the way to find the information you need efficiently has great commercial value and scientific value, and it is always a contribution to the computer industry.Data mining has been a popular research topics in the world, especially when the coming of the era of big data, related technologies and algorithms of data mining is to get the attention and the research of many scholars, frequent subgraph mining is a basic knowledge in data mining, and it is mostly studied.Data statistics and analysis, information retrieval, machine learning and pattern recognition are the main Implementations. The data mining methods based on Apriori algorithm and Fp- growth are the main research direction and they are also classical. This essay is based on g Span and it is an improvement of g Span, g Span algorithm is applying on Hadoop, so it can accomplish graph mining distributedly.Firstly, in this paper, k-medodis algorithm is improved and it is called k-medodis++ algorithm. k- medodis++ algorithm selects initial center, improve the stability of the k-medodis++ algorithm and then we get a k cluster.Secondly,implementing the distributed storage of each cluster and setting copies for every data block. Through computing the balanced value, we judge whether a copy can be placed in one node, avoid the case that one’s balanced value is bigger than the given threshold.Thirdly, establishing a four layer index structure named MADI to replace the adjacency list. The subgraph extension and support calculation is based on the retrieval of the MADI index structure, so the complexity of algorithm and the number of disk IO is reduced.Finally, we can make sure some candidate subgraphs are min- DFS-code through extending subgraph orderly, so the steps of subgraph isomorphism and the complexity of g Span++ algorithm are reduced.
Keywords/Search Tags:Data Mining, Frequent subgraph, k-medodis clustering, Hadoop, MapReduce, gSpan
PDF Full Text Request
Related items