Font Size: a A A

Chinese Language Network Statistical Properties Of Semi-supervised Document Clustering Algorithm Research

Posted on:2009-05-14Degree:MasterType:Thesis
Country:ChinaCandidate:G B HuFull Text:PDF
GTID:2208360272959839Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the exponentially increasing of electronic documents on the web, there exists urgent need to process these documents automatically using computer, including automatic classification, clustering and summarization of documents. In this paper, we focus on document clustering. Document clustering consists of three parts: document representation, clustering algorithm and performance evaluation, of which document representation and clustering algorithm are the most important steps.Vector Space Model(VSM) dominates in the document representation. Due to the semi-structure characteristic of documents, it is easy to see the drawbacks of VSM. Recently, complex network, as a tool for studying complex system, has attracts a lot of attention from researchers. Documents, as the written-language of human being, also can be regarded as complex system. We study the statistical properties of the Chinese language network, within the framework of complex network, in order to shed some new sights about document representation. Based on one of the largest Chinese corpora, i.e., People's Daily Corpus, we construct two networks (CLN1 and CLN2) from two different respects, with Chinese words as nodes. In CLN1, a link between two nodes exists if they appear next to each other in one sentence; in CLN2, a link represents that two nodes appear simultaneously in a sentence. We show that both networks exhibit small-world effect, scale-free structure, hierarchical organization and disassortative mixing. We hope these results can provide a new clue for the document representation.As with clustering method, we focus on semi-supervised document clustering. In real applications, some limited knowledge about cluster membership of a small number of documents is often available, such as some pairs of documents belonging to the same cluster. This kind of prior knowledge can be served as constraints for the clustering process. We integrate the constraints into the trace formulation of the sum of square Euclidean distance function of K-means. Then the combined criterion function is transformed into trace maximization, which is further optimized by eigen-decomposition. Our experimental evaluation shows that the proposed semi-supervised clustering method can achieve better performance, compared to several existing competitive methods.
Keywords/Search Tags:complex network, language network, text ming, clustering, semi-supervised learning, algorithm
PDF Full Text Request
Related items