Study And Implementation Of Automatic Parameter Setting For Document Clustering

Posted on:2006-04-11

Degree:Master

Type:Thesis

Country:China

Candidate:M Zhang

Full Text:PDF

GTID:2168360155971717

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the increasing of document resources in multimedia database and Web, processing documents by hand no longer matches the increasing speed and meets people's requirements. What people need is organizing documents in an effective form, for the convenience of information retrieve, pattern discovery and recommendation, and also for the purpose of preparing for categorizing the new coming documents. And then comes the document clustering techniques. Document clustering is to separate the document set into groups, in each group the documents are of the same or related topic. The purpose of document clustering is to generate the clusters in which documents are of the most topic-related, and between which documents are of the most topic-unrelated.The works have been done are to transform the unstructured documents into the structured data object, then apply the classical clustering algorithms to them. The document clustering algorithms mainly used are partitioning method, such as K-Means and K-Medoids, hierarchical method such as Hierarchical Agglomerative Clustering (HAC), neural network based method such as Self-Organizing Map (SOM), and model-based clustering algorithm. All of these algorithms have their disadvantages, some of them need the input parameters that are difficult to set by users, and some of them have too low time efficiency.The concept "Maximal Sequential Frequent Phrase (MSFP)" is proposed in this thesis first. Contrary to the neglect of the relationship between terms in TFIDF method, MSFP takes the relationship into account, and guarantees the orders between terms. This method can obtain the better quality in term selection, preparing for the clustering in next step.This thesis studies the methods to set the input parameters in document clustering algorithm automatically. For K-Means algorithm, a method that determines the input parameter K by multi-sampling is proposed. Also scalar factor coming from SOM is introduced into the clustering process, altering the mean value in each cluster during the separating step in K-Means. These two improvements induce the K-Means algorithmrequiring no input parameter and good result performance.To eliminate the sensitivity to outliers in K-Means and to improve the clustering efficiency and performance further more, density-based clustering algorithm is applied to document clustering in this thesis. For this purpose, one novel method determining the parameters by multinomial fit is proposed. And with the help of automatic parameters setting method, the cluster is contracted step by step, generating fine cluster finally.Experiments show that the automatic parameter setting in document clustering generates more satisfied clustering result and improves the clustering efficiency.

Keywords/Search Tags:

Data Mining, Document Clustering, Term Selection, Automatic Parameter Setting

PDF Full Text Request

Related items

1	Design And Implement Of Web Document Clustering System
2	Research And Application Of Feature Selection Based On Term Frequency Reordering Of Document Level
3	Research Of Clustering Analysis And Its Application In Document Mining
4	Research On Document Clustering Technology Based On Latent Semantic Indexing
5	Search term selection and document clustering for query suggestion
6	Multi-Document Automatic Summarization Based On The Term-Sentences—Document Tri-layer Graph Model
7	Study On Clustering For XML Document Collection
8	Research On Mining Accompanying Behavior Pattern Methods Based On Spatial-temporal Trajectory Data
9	Effective use of term relationships in Web content mining
10	Research On The Application Of Data Mining Technology In Rental Data