Font Size: a A A

Research And Application Of Graph Mining Algorithms

Posted on:2017-10-14Degree:MasterType:Thesis
Country:ChinaCandidate:K Y GuiFull Text:PDF
GTID:2348330518995544Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the rise of social networks,graph mining is receiving more and more attention.Researches about graph mining are increasing in SigKDD,ICDM,SiamDM and other academic conferences year by year.In graph mining,maximal clique enumerating(MCE)is a fundamental problem which has both research and practical value.MCE can be applied in many fields,such as social networks analysis,social hierarchy estimation through email networks,study of structures of behavioral and cognitive networks,statistical analysis of financial networks,clustering of dynamic networks,detection of terrorist incidents and emergencies,etc.This thesis regards MCE as the access point to the study of graph mining,conducts research on MCE and applies it to a practical problem.The work of this thesis includes three parts:1.Studies and implements Base BK algorithm,Improved BK series algorithms and Kose series algorithms,does experiments on these algorithms and discusses the advantages and disadvantages of them.First,research on Base BK algorithm is conducted,which is one of the earliest algorithms to enumerate all maximal cliques recursively.Because the recursive space tree of Base BK algorithm is too large,Improved BK series algorithms are proposed which prune the recursive space tree by selecting a flag node via various heuristic methods.Two famous algorithms of the series are Improved BK algorithm with least/minimum counter and random CANDIDATES&NOT Improved BK algorithm,which are both studied and implemented in this thesis.Besides,Kose series algorithms also enumerate all maximal cliques recursively.They avoid the generation of unnecessary sub-groups,but increase the consumption of memory.As for memory consumption,Kose RAM algorithm and Kose Disk algorithm store data in memory and on disk respectively.Last,seven classical MCE algorithms are all implemented in single-process mode and over 1000 experiments are conducted to make it clear how to choose MCE algorithms under different circumstances.2.Points out that the algorithm proposed by the paper Fast Algorithms for Maximal Clique Enumeration with Limited Memory(FAMCELM)will fail in a specified case.Then proposes a solution with limited memory,which is proved to be feasible theoretically and experimentally.In addition,a study on the SeqMCE algorithm proposed by FAMCELM is done.The algorithm can enumerate all maximal cliques in large-scale graphs and has three advantages:it reduces the consumption of memory,it reduces the cost of CPU computation and it reduces the overall running time by parallelizing MCE based on a partitioning strategy.However,the SeqMCE algorithm may not run well when there are several nodes with extremely large degrees in the graph.The thesis analyzes the causes of the problem and offers a solution to it which is theoretically proved.With several optimizations,the Improved SeqMCE algorithm can run well in any case,while it may enumerate several maximal cliques more than once.3.In order to get the number of distinct phone numbers that have sent spam messages during a period,a hash algorithm based on prior probability(GHash)is proposed which relies on Improved BK algorithm and Improved SeqMCE algorithm.As the preprocessing of this algorithm,the statistical regularity of strings with the same length(i.e.phone numbers)must be found.The thesis regards each position of a string as a a node of a graph,then transfers this problem to finding the longest chain in the graph,and enumerates all maximal cliques in its closure.Since MCE is an NP problem which takes too long,the thesis tries to get a subset by calculating the entropy of all strings,which performs well when the strings are phone numbers.When counting the number of distinct strings with the same length,GHash uses much less memory than Bitmap Algorithm does,takes much less time to search an object than Red-Black Tree,AVL Tree and Trie do,and gets more precise result comparing to Bloom Filter.Experiments on a data set with 30 million phone numbers show that GHash is superior to traditional hash algorithms in both search speed and hash collisions.
Keywords/Search Tags:graph mining, maximal clique, BK algorithm, Kose algorithm, SeqMCE algorithm, Hash algorithms
PDF Full Text Request
Related items