Font Size: a A A

Research On Key Technologies Of Mining Communities In Web Information Network

Posted on:2012-10-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:F L HuangFull Text:PDF
GTID:1488303356993039Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Billions of web users in the world have been becoming the unchallengeable main players in the free, open and sharing Web 2.0 cyberspace. Web information networks represented by blog network, email network, wiki network and etc have spread to practically all corners of human lives. Such miscellaneous web information networks reflect the complicated production and life structure of human being. It is an enormously challenging undertaking to discover the potentially valuable and hidden communities from the sophisticated relational structures, which eventually would contribute to enhancement of information service, public security and the analysis of complex networks. The result of such undertaking would have general values both in theory and practice.This dissertation focuses on mining communities in web information network, and further studies key technique models and algorithms in mining web information network communities in detail. It proceeds along such research line starting from content characteristics of web information carrier, then covering structure characteristics, and ending with content combined with structure characteristics. The primary contributions of the dissertation are as follow:1. Based on granular theory and article structure theory, a multi-granular hierarchical representation model (MHRM) integrating document representation and clustering is proposed in order to avoid inferior web document communities that may result from web documents clustering algorithms based on traditional vector space model. In MHRM, we have introduced paragraph level granular knowledge to bridge the gap between document level granular knowledge and term level granular knowledge in web document representation, and designed two optional strategies to deal with zero similarities among clustering objects in paragraph clustering, i.e., tolerance-rough-set based strategy and ontology based strategy, and presented a measure to evaluate the contribution of the paragraph level granular knowledge to topic of the document level granular knowledge in document clusteirng. It turned out that MHRM can effectively discover the true communities hidden in web documents.2. A genetic algorithm, EBSGA, which is founded on eugenics and a particle swarm algorithm known as MLCPSO, which is based on democratic leadership, are designed to improve search ability of natural computing algorithms from the viewpoint of population diversity. Our simulation results have shown that both algorithms possess powerful optimizing ability. In order to discover communities hidden in newsgroup network, we have introduced SVD technique to construct latent semantic subspace of newsgroup, and then presented a hybrid optimization algorithm which consists of EBSGA and MLCPSO. This hybrid algorithm is proven to be able to leverage powerful global search ability of PSO and local search ability of GA, and can effectively detect the communities hidden in newsgroups.And a hybrid optimization algorithm, EBSGA/MLCPSO, which consists of EBSGA and MLCPSO is presented, which can leverage powerful global search ability of PSO and local search ability of GA, in order to discover communities hidden in newsgroup network, we have introduced SVD technique to construct latent semantic subspace of newsgroup. Experimental results from real datasets indicate that above three algorithms can effectively detect the communities hidden in newsgroups, but EBSGA/MLCPSO performs best.3. A discrete PSO algorithm to mine non-overlapping communities, CDPSO, is devised. In this algorithm we have introduced community modularity as fitness of particle, devised a neighbor-list-based particle encoding scheme and an improved discrete particle position update strategy, and analyzed the convergence of the strategy. CDPSO can quickly and effectively discover the intrinsic community structure in networks without any domain information. Then we introduced the concept of line graph, and further proved the property that nodes partition of line graph is equivalent to nodes cover of the corresponding origin graph. Based on CDPSO and the property, an overlapping communities mining algorithm, LGPSO, is proposed which can effectively discover overlapping communities.4. After the basic idea of typical spectral clustering algorithms is analyzed, advantages and disadvantages of each algorithm are pointed out, and performance of the typical algorithms is compared in discovering web communities. And more, we have integrated spectral graph theory and rough set theory, and proposed the overlapping communities mining algorithm RSC, which can describe communities membership of network nodes with lower and upper approximation, describe the network nodes shared by different communities with boundary, and so achieve effective discovery of overlapping communities by optimizing modularity of overlapping communities.5. Heterogeneity and massiveness of online social networks(OLNs)are analyzed, and OLNs and its mining task is formally defined. In consideration of variety of community definitions, and the fact that a community structure conforming to some given definition is applicable to different places and, further, based on the analysis of current heuristic mining approaches, we have presented a general heuristic framework for communities discovery, which has good openness and feasibility.6. An effective algorithm to discover communities hidden in online chatting rooms based on content and thread structure of chatting text stream is proposed. The algorithm can effectively discover hidden relations of the involved chatters with the following two techniques: one is to semantically compute chatting text content similarity, and another is to analyze thread structure association of dialogs in text stream.
Keywords/Search Tags:web information network, communities mining, clustering, natural computation, spectral graph theory
PDF Full Text Request
Related items