Font Size: a A A

Clustering Based Net User Interest Mining

Posted on:2013-02-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:L MaFull Text:PDF
GTID:1268330431459978Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
With the development of net application, the service model transforms from theintegration and uniform model to the contribution and personalization model. To realizethis model conversion, one precondition is having in-depth understanding of demandrules to net users, results in guiding the organization and adjustment of informationresources of information service systems according to these rules, and making theinformation of user requirements and system supply as far as consistent. As one form ofnet user information demand rule, the net user interest is the foundation to construct anew generation system of information service which has the character ofself-configuration for the resource organization.To make beneficial attempt such as reducing the computational complexity ofclustering algorithms and realizing the soft clustering and exploring new methods tosolve the clustering problems, this paper conducts deeper research on user’s interestmining algorithm based on text clustering algorithm and other related questions,focusing on the goal to extract the user interest, making Chinese text information ofusers accessing webpage as object, and using theories and techniques like complexnetwork theory, graphic theory, stochastic process theory, artificial immune networktheory and Chinese semantic calculation. The main content contains the following foursections:(1) The interest mining model. The network user interest model is the description for thebehavior law of individual user and user groups using net, and the mining model is a setof standardized processes for getting the user interest model. According to the behaviorprocess of Web users accessing to Web sites, this paper proposed a concept model of netuser’ s interest excavation, which was based on the model of information processcontained in the theory of the full information. The core of this model is a processingprocedure to describe and mine user’s interest mode from the Angle of informationcognition, which is described by grammar and semantic cognition. The importantfeature of this mining model is unifying the user’s interest process which is multi-leveland multi-perspective to a frame. To guide mining task in detail, this paper gave themining model of user’s interest mode and migration mode upon clustering analysis. Andat last, we use the experiment to demonstrate the rationality of the models.(2) Dimension reduction algorithm in text clustering. Focused on the typical problemsof the big dimension number, we used the features of description the dynamic properties and structure factors between the nature and the artificial system of the model ofsmall-word net and we used the K-nearest coupling algorithm to construct the textwords network diagram, and in this diagram, the nodes stands for the words in the textand the ledges stands for the neighbor relationship on distance. We also exposed theimportance of the changing of the clustering number and the shortest path length inmeasuring the words. Through calculating the variation of the words, we can verifywhether it has the feature of small-word network and to realize the selection of thefeature words. The results of experiments demonstrated that the method is rationalityand this is a new way to extract the feature of the text.(3) On the research of the clustering algorithm. Although there are many goodalgorithms for realizing the analysis of text clustering, it is hard to guarantee theconsistency between the text types and the requirements, because of the ambiguity ofthe words and the sparse of the text feature. So, it is necessary to research the clusteringalgorithm from other technical ways. Based on a simple description of the basicprinciple of biology immune and colonel process, a poly-colonel clustering algorithmwith self-adaptive feature is put forward. The main idea of the algorithm is to putvarious operators in artificial immune system into clustering process and adjustclustering numbers automatically by affinity function. The recombination operator isintroduced to increase the diversity of antibody group so as to broaden the search scopeof the global optimization solution and avoid early mature phenomenon of the group.And the non-consistent mutation operator is introduced to enhance the adaptability andoptimize the performance of local solution seeking; meanwhile convergence of thealgorithm is speeded up. The experimental result shows that reasonable clustering couldbe realized by the proposed algorithm.In this paper we introduce the method of complex network theory to textclustering analysis, based on the algorithm for detecting community structure incomplex networks, a new method of clustering algorithm is proposed. In the method,we define text similarity measure methods through HotNet similar calculation formula.Structure an association diagram according to the text document similarity by using aclustering algorithm named Newman to cluster texts and analysis. This method isappropriate for dealing with large-scale problems.Focused on the problems of strict classification and the high calculation complexityof the normal algorithms, we considered the feature of suffix tree in expressing therelation between the different words, the short ergodic time, and the increment refresh process, and brought the suffix tree model into the text clustering, and we also exposedthe suffix tree clustering algorithm based in the semantic calculation and the clusteringalgorithm based on the suffix net. The results of the experiments showed that bothalgorithms can realize the soft clustering and have the features of small time complexityand strong readable of class cluster identification.(4) On finding the interest model and drift pattern of net user. The actual form ofthe user interest model is composed with a group of feature-words which have acharacter of significant category. The method, calculating the frequency of the samewords or the similarly words in the most texts, was used in generating the interest of thenet user. It is a dynamic expression with the time going that the interest drift pattern ofnet user. Focused on the problem of multiple themes of text, a method for getting theinterest sequence based on Hidden Markov was proposed in this paper. In this method,the Hidden Markov Model of net user interest was created with the objects of the accesssequence and interest, using the decoding problem related algorithm to obtain the bestinterest sequence. Through sequential pattern mining algorithm to get the frequentsequence mode which is the interest drift pattern. The essence of the pattern is a kind ofinterest related rules with the feature sequence. In order to improve the miningefficiency, a mining algorithm based on Frequent Link-Access Tree (FLaAT) was usedto mine the frequent mode,, and this algorithm has some advantages, such as fastprocesses speed and the incremental mining through refreshing the structure sequenceof FlaAT. Experiments show that the proposed method is viable, the interest pattern digout can not only show the interest changes, but also can reflect the relationship and thechange rules between the interests.
Keywords/Search Tags:net user interest, interest mining model, feature dimension reduction, text clustering, semantic similarity calculation
PDF Full Text Request
Related items