Font Size: a A A

Similarity Search On Heterogeneous Information Networks

Posted on:2019-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:S P MaFull Text:PDF
GTID:2428330596460874Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In this big data age,extensive requirements emerge in data management and analysis fields.Heterogeneous information networks(HIN)are widely used as data models due to its rich semantics in expressing rich data correlations.It is now widely used in many fields such as bibliographic database,social network,chemical and biological information system,etc.The data similarities other than the exact matches are required in many data mining,data analysis and machine learning algorithms.Graph edit distance(GED)is one of the feasible methods on HINs similarity measuring.In this thesis,we firstly extend the concept of GED in homogenous graphs to the heterogeneous information networks by introducing newly defined edit operations.Then we define the HIN edit distance.As exact edit distance computation is NP-Hard in general,two mapping distances as approximation methods of GED are proposed.The star-based approximation is a revised version of the graph edit distance approximation algorithm used in homogeneous graphs,based on which the lower bound and upper bound criteria are deduced accordingly;while the metapath-based approximation is an innovative strategy designed specifically for HINs,in which the metapaths are utilized as basic semantic components.Additionally,the metapath-based upper bound and lower bound,both of polynomial time complexity,are introduced to improve the performance of full database similarity search.Since the GED calculation focuses on the structural information of HINs and ignores the semantic information,in some cases,the HIN edit distance does not accurately measure the similarities between two heterogeneous information networks.This thesis proposes a similarity measure method for HINs based on feature structure,and defines three basic feature structures for expressing complex semantic information in heterogeneous information networks.This thesis defines the domain-structure based on the importance of feature structure,and proposes two feature structure sequence extraction methods.Then,a HIN can be convert into a feature structure sequence.Finally,this thesis also introduces the weight function,and proposes a weighted domain-structure sequence similarity algorithm,which converts the similarity calculation of two HINs into the similarity calculation of two feature structure sequences.Finally,comprehensive experimental studies are conducted on both real and synthetic datasets.The results show our metapath based approach outperform the existed star-structure based method extended to HINs in terms of computational efficiency,bound tightness,space occupation and similarity query filtering.The feature structure based similarity search on heterogeneous information networks also shows its better precision than HIN-ED.The feature structure sequence extraction algorithm based on the residual network propagation degree outperforms the feature structure sequence extraction algorithm based on structural connectivity.The weighted domainstructure sequence similarity algorithm also shows excellent accuracy,recall and expansion in HIN similarity search applications.
Keywords/Search Tags:Heterogeneous Information Networks, Graph Edit Distance, Metapath, Mapping Distance, Feature Structure Sequence
PDF Full Text Request
Related items