Research On Semi-Supervised Hierarchical Co-Clustering For Document

Posted on:2013-08-16

Degree:Master

Type:Thesis

Country:China

Candidate:F F Huang

Full Text:PDF

GTID:2248330371995437

Subject:Computer application technology

Abstract/Summary:

In recent years, document clustering is a fundamental and effective tool for efficient organization, summarization, navigation and retrieval of massive amount of documents. In general, the problem of document clustering is described as follows:given a set of documents, group them into different clusters where documents in the same cluster are similar to each other and documents in different clusters are dissimilar to each other. In case of no priori knowledge, document clustering is an unsupervised learning process.Co-clustering is a clustering method which simultaneously or alternately performs the clustering tasks on the rows and columns of an input data matrix. For document clustering, using the popular vector space model, documents are represented by a matrix where the rows represent the objects (document vectors) and the columns represent feature terms (word vectors). The hierarchical co-clustering is used to cluster documents and feature terms simultaneously. The limitation of using simple hierarchical co-clustering is that it has a lot of feature terms and documents, and it also ignores the semantic relations between feature terms. The time complexity will be increased, and at the same time, the accuracy might be reduced. Semi-supervised clustering uses a small amount of priori knowledge to help the clustering process. Under the guidance of the priori knowledge, clustering results will be improved.In this paper, first of all, the traditional weight model is analysised and evaluationed while the documents are collected and pretreated. Then the traditional weight model is proposed by curve fitting. The proposed model overcomes the shortcomings. Experimental results show that the proposed model has improved the efficiency.In this paper a semi-supervised clustering based on pairwise constraints algorithm is proposed for feature terms co-clustering. Then the feature terms are merged as new feature attributes. It actually combines similar feature terms and also reduces the dimensionality of the vector space model, and reduces errors caused by synonyms. First of all, the pairwise constraint sets are found out in feature term set. After that the constraint sets can be expanded by using the K-nearest neighbors set method, and clustering according to the partition results of the constraint set. Finally the feature terms in the same cluster are combined as a single feature attribute.Simple hierarchical co-clustering for document clustering is used, but ignoreing the semantic relations between feature terms and the semantic relations between documents. Itâ€™s not thorough that the doucuments and feature terms are considered as independent objects. In this paper, the semantic information is used to measure the semantic similarity between documents and the semantic similarity between feature terms. Then a cooperative matrix is constructed for co-cluster hierarchically. Experimental results demonstrated that this proposed algorithm has improved the efficiency and accuracy.

Keywords/Search Tags:

Document Clustering, Hierarchical Clustering, Co-clustering, Semi-supervisedClustering, Semantic Information

Related items

1	Semantic Hierarchical Clustering Based Multi-document Summarization Research
2	Research On Deep Text Clustering Method Based On Semantic Information Enhancemen
3	A Document Clustering Method Based On Affinity Propagation And Agglomerative Hierarchical Clustering
4	Research On Efficient Document Clustering Using Improvised Sub-Document Based Framework
5	Distributed Clustering And Evolutionary Clustering Algorithm Based On Semi-supervised Learning
6	New Non-hierarchical Clustering Objetives And The Algorithms To Optimal Clustering
7	Research And Implementation Of Web Document Clustering Algorithm Based On Semantic Gravitation And Density Distribution
8	Incorporating semantic and syntactic information into document representation for document clustering
9	Research Of The XML Document Clustering Using GA
10	Research On Hybrid Algorithm Based On Subtractive Clustering