Font Size: a A A

Researches On Large-Scale Mining Algorithms Of User Behaviors In Internet

Posted on:2012-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhouFull Text:PDF
GTID:2178330338492139Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the fast development of the computer technology and Internet, more and more applications based on users are generated in the Web. These applications have collected massive user behavior data for several years, and the data is growing exponentially. This massive data contains large amounts of information about users. It can help Internet applications provide betted user experience, and improve company's market competitiveness, if we could find useful knowledge from the massive user information, and get the user behavior patterns behind these data. In this paper, we analyze and study the user behaviors in Internet using the data mining method, and find the hidden regular patterns and models. We carry out our researches in two aspects: the analysis of user tagging behavior in the social tagging system based on Web 2.0; and the analysis of user querying behavior in the search engine in Internet.(1) In the social tagging systems, user can mark resources freely with different tags, and achieve the goal of information resource's organization, classification and retrieval using these user tags. But there are many problems in this kind of free user tagging behavior, such as imprecision of information description, mess of tag organization and confusion of tag semantic meanings. Existing research often uses data mining techniques, such as clustering, to remedy the problems of tag redundancy and ambiguity. The current tag clustering algorithms are mainly based on the tag co-occurrence in different items, but these algorithms'clustering precision and recall are relatively low, which can only calculate the similarity between two tags. We propose a new tag clustering algorithm in this thesis, which introduces an object-based feature vector to characterize a single tag. This feature vector can represent a tag exactly and can get a more accurate similarity between two tags by using cosine similarity formula. K-Means algorithm is used to cluster the users'tags. The experiment shows that the algorithm proposed in this paper can get a more accurate clustering result. At last, we apply this algorithm to the"Library Interactive System for Education and Research"system in our university to approve this algorithm's practicability.(2) On the other hand, in the search engine, backend log record user's input queries and clicked URLs as the interactive information between user and search engine. Through mining the user behaviors in search log, we can find user behavior's regular pattern, collect statistic information, and then use it to improve the search engine's result rank. But because of the massive property of search engine's log data, traditional clustering method cannot handle the analysis of user behaviors in search engine. Towards this problem, we use a tripartite graph to model the user behavior in search engine, and use a feature vector to characterize user input queries, and then propose a distributed K-Means clustering algorithm based on inverted table's query and MapReduce in this thesis. The experiment shows that this algorithm can handle the clustering problem of massive user queries, and demonstrate effective performance in large-scale data set. At last, we analyze the characteristic of user behaviors in current search engine based on the clustering result.
Keywords/Search Tags:feature vector, data mining, user behavior analysis, K-Means, distributed system, MapReduce
PDF Full Text Request
Related items