Font Size: a A A

Study And Implementation Of Automatic Parameter Setting For Document Clustering

Posted on:2006-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:M ZhangFull Text:PDF
GTID:2168360155971717Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the increasing of document resources in multimedia database and Web, processing documents by hand no longer matches the increasing speed and meets people's requirements. What people need is organizing documents in an effective form, for the convenience of information retrieve, pattern discovery and recommendation, and also for the purpose of preparing for categorizing the new coming documents. And then comes the document clustering techniques. Document clustering is to separate the document set into groups, in each group the documents are of the same or related topic. The purpose of document clustering is to generate the clusters in which documents are of the most topic-related, and between which documents are of the most topic-unrelated.The works have been done are to transform the unstructured documents into the structured data object, then apply the classical clustering algorithms to them. The document clustering algorithms mainly used are partitioning method, such as K-Means and K-Medoids, hierarchical method such as Hierarchical Agglomerative Clustering (HAC), neural network based method such as Self-Organizing Map (SOM), and model-based clustering algorithm. All of these algorithms have their disadvantages, some of them need the input parameters that are difficult to set by users, and some of them have too low time efficiency.The concept "Maximal Sequential Frequent Phrase (MSFP)" is proposed in this thesis first. Contrary to the neglect of the relationship between terms in TFIDF method, MSFP takes the relationship into account, and guarantees the orders between terms. This method can obtain the better quality in term selection, preparing for the clustering in next step.This thesis studies the methods to set the input parameters in document clustering algorithm automatically. For K-Means algorithm, a method that determines the input parameter K by multi-sampling is proposed. Also scalar factor coming from SOM is introduced into the clustering process, altering the mean value in each cluster during the separating step in K-Means. These two improvements induce the K-Means algorithmrequiring no input parameter and good result performance.To eliminate the sensitivity to outliers in K-Means and to improve the clustering efficiency and performance further more, density-based clustering algorithm is applied to document clustering in this thesis. For this purpose, one novel method determining the parameters by multinomial fit is proposed. And with the help of automatic parameters setting method, the cluster is contracted step by step, generating fine cluster finally.Experiments show that the automatic parameter setting in document clustering generates more satisfied clustering result and improves the clustering efficiency.
Keywords/Search Tags:Data Mining, Document Clustering, Term Selection, Automatic Parameter Setting
PDF Full Text Request
Related items