Font Size: a A A

Research On Similarity Search In Information Network

Posted on:2014-06-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:M X ZhangFull Text:PDF
GTID:1108330434971260Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
There are kinds of entities in our real life, which consist of the massive, connected and complex networks, such networks are called information networks. The links among entities in information networks contain rich semantic information, which can be exploited for knowledge discovery. With information networks becoming massive and complex, there is a need for designing algorithmic tools and developing applications to exploit the underlying structure in the data.Many works on information networks have been devoted, such as clustering, com-munity mining, outlier detection, similarity search. Similarity search, which focuses on finding the top-k most similar entities for a given query entity, is a special important issue and draws extensive interests from various research fields, such as recommendation system, link prediction, proximity query processing. Many studies on similarity search over information networks have been devoted in recent years. Most existing methods usually require high space and time cost, such as SimRank, PSimRank, P-Rank. The storage and efficiency problems would be run into with information networks becoming massive.Information networks with x-star network schema are the information networks of important type in our real life, also called x-star networks, which consist of centers with connections among themselves and different type attributes linking to these centers. X-star networks are becoming ubiquitous. In this paper, we study similarity search in X-Star networks, which focuses on finding the top-k most similar centers for a given query center. We do a lot of work for resolving the problem of efficiency and storage of similarity computation, efficiency of on-line query processing, and the precision of the returned result. The major contributions of this paper are summarized as below.1. For resolving the efficiency and storage problem of the existing similarity computa-tion methods, we propose a new structural-based method, called NetSim, which can be used for computing center similarity in X-Star networks. In existing methods, matrices for storing similarities among different type entities need to be maintained while computing similarities, which usually require high space and time cost. Net-Sim measures the similarity between center entities based on the intuition that similar centers are usually linked with similar attributes. The attribute similarity is computed using the expected meeting probability over attribute network that is extracted from the whole structure of X-Star network. NetSim requires less space and time cost than existing methods. Extensive experiments on DBLP and Amazon datasets have been done, which demonstrate that NetSim requires less space and time cost than existing methods, and the effectiveness is acceptable.2. For supporting fast online query processing, we study the problem of top-k similar-ity search in X-Star networks. We firstly propose a naive online query processing method, called NetSim-baseline, and analyze the factors that affect the execution efficiency. Then, we develop an efficient pruning method, called NetSim-pruning, based on a proposed pruning-index. NetSim-pruning decreases the time cost of on-line similarity computation, and prunes candidate centers that are not promising, which makes the online query processing more efficient. Analysis on the mathemat-ical properties of NetSim-pruning is given, from which we can find that NetSim-pruning can decrease the online query processing time cost with little effectiveness loss. Extensive experiments on DBLP and Amazon datasets have been done, which demonstrate the effectiveness, efficiency and storage of NetSim-pruning through comparing with the state-of-the-art measures.3. For improving the effectiveness of similarity computation, we propose a structural-based similarity measure, called E-Rank, which measures the similarity between entities by using the expected meeting probability. Comparing with the many existing methods which consider only the meetings between two entities that walk along equal length paths, E-Rank addresses the meetings between two nodes walk along any possible length paths. Besides, E-Rank makes use of relationship strength to enhance effectiveness, which is not addressed in many existing methods. We have done extensive experiments on Enron e-mail network and high-energy physics theory citation network, which demonstrate the effectiveness of E-Rank by comparing with the state-of-the-art measures. By applying E-Rank to NetSim, we propose ENetSim for computing similarity between centers. Experiments on Amazon dataset show that ENetSim is more effective than NetSim.
Keywords/Search Tags:Similarity Search, Information Network, X-Star Schema, NetSim, E-Rank
PDF Full Text Request
Related items