Font Size: a A A

Research Of Document Clustering For User Interest

Posted on:2009-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y W WangFull Text:PDF
GTID:2178360248950004Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the expansion of document resources and web pages on the internet, it becomesdifficult for people to get the information they need from the web. Therefore, how toeffectively organize the great magnitude document resources and help users access theinformation they really need is a problem that highly desirable to be solved in the field ofinformationretrieval.Document clustering is a very important technology in text mining. It has been widelyused in information management, search engine, recommendation system and other fields.The k-means algorithm is a method that most commonlyused in document clustering, whichis simple and with fast convergence. This paper mainly focuses on the research andimprovementofthek-meansalgorithm.Firstly, for the drawback that k-means algorithm needs the assignation of finalclustering's number and the random selection of initialization, a new kind of initialization ispresented, which is based on reference region. In fact, the algorithm is improved with thecombination of k-means algorithm and the clustering algorithm based on density. Theexperiment shows that the improved algorithm can get better result, compared with thetraditional k-means algorithm. Meanwhile, it can keep the efficiency of algorithm based ondensity.Secondly, for the drawback that the k-means algorithm tends to get stuck at a localmaximum far away from the optimal solution, an optimization based on local search is usedto improve the algorithm. According to the characteristic of text data, the clustering will bepartitioned by the way of moving much of the data. This procedure makes the appropriateiterations to enlarge the search space.The theory analysis and experimental results show thatthe optimization improves traditional k-means algorithm efficiently, and its computation isalsolinearinthesizeofthedocumentcollection. Finally,the technologies ofdocument clusteringanduserinterest modelingare carefullyresearched and integrated. A clustering system for user interest modeling is made, which iscalled CSUI (Clustering System of Users'Interest). This system uses the improved k-meansalgorithm to cluster those web pages which users have viewed. At last, it outputs the users'interestinaformofthecorrespondingmodel.
Keywords/Search Tags:Document Clustering, k-means, Reference Region, Local Search, User InterestModeling
PDF Full Text Request
Related items