Chinese Language Network Statistical Properties Of Semi-supervised Document Clustering Algorithm Research

Posted on:2009-05-14

Degree:Master

Type:Thesis

Country:China

Candidate:G B Hu

Full Text:PDF

GTID:2208360272959839

Subject:Computer software and theory

Abstract/Summary:

With the exponentially increasing of electronic documents on the web, there exists urgent need to process these documents automatically using computer, including automatic classification, clustering and summarization of documents. In this paper, we focus on document clustering. Document clustering consists of three parts: document representation, clustering algorithm and performance evaluation, of which document representation and clustering algorithm are the most important steps.Vector Space Model(VSM) dominates in the document representation. Due to the semi-structure characteristic of documents, it is easy to see the drawbacks of VSM. Recently, complex network, as a tool for studying complex system, has attracts a lot of attention from researchers. Documents, as the written-language of human being, also can be regarded as complex system. We study the statistical properties of the Chinese language network, within the framework of complex network, in order to shed some new sights about document representation. Based on one of the largest Chinese corpora, i.e., People's Daily Corpus, we construct two networks (CLN1 and CLN2) from two different respects, with Chinese words as nodes. In CLN1, a link between two nodes exists if they appear next to each other in one sentence; in CLN2, a link represents that two nodes appear simultaneously in a sentence. We show that both networks exhibit small-world effect, scale-free structure, hierarchical organization and disassortative mixing. We hope these results can provide a new clue for the document representation.As with clustering method, we focus on semi-supervised document clustering. In real applications, some limited knowledge about cluster membership of a small number of documents is often available, such as some pairs of documents belonging to the same cluster. This kind of prior knowledge can be served as constraints for the clustering process. We integrate the constraints into the trace formulation of the sum of square Euclidean distance function of K-means. Then the combined criterion function is transformed into trace maximization, which is further optimized by eigen-decomposition. Our experimental evaluation shows that the proposed semi-supervised clustering method can achieve better performance, compared to several existing competitive methods.

Keywords/Search Tags:

complex network, language network, text ming, clustering, semi-supervised learning, algorithm

Related items

1	Research On Text Clustering Based On Semi-supervised Learning
2	Research On Network Uncivilized Text Classification Methods Based On Semi-supervised Learning Models
3	Study Of Density-based Semi-supervised Clustering Algorithm On Complex Network
4	Semi-supervised Learning On Text Data
5	Research On Semi-supervised Clustering And Classification Algorithm
6	A Novel Labels And Similarity Reconstruction Based On K-means Algorithm Application On Text Clustering
7	Research On Clustering Algorithm For Complex Network
8	Research On Unbalanced Text Classification Based On Text Augmentation And Semi-Supervised Learning
9	Research On Identification Method Of Uncivilized Weibo Post Based On Semi-Supervised Learning Model
10	The Research Of Semi-supervised Learning Based On Boosting