Font Size: a A A

A Chinese Text Clustering Without Dictionary Based On The Improved Fuzzy C-Means Algorithm

Posted on:2008-11-12Degree:MasterType:Thesis
Country:ChinaCandidate:X L ZhengFull Text:PDF
GTID:2178360212984980Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Now, accompanying with the fast development of the network, various information is also follow. Its quantity is hard to estimate. The information has different forms, but among them the mostly probably form is the text. The popular text process is firstly doing the text clustering or text classification by computer, then deal with this classified information.The main discussion of this paper is to focus on the Chinese text clustering process. How to get a good clustering text set in the process of word extraction, feature selection and fuzzy clustering.Commonly, the most popular method is treating every text as a hyperspace vector. Each coordinate of the vector is the frequencies of the words in the text. Compare with foreign language text, Chinese text need an additional process to segment the words. With studying several word segmentation algorithms based on statistics, this paper puts forward new word segmentation without dictionary, which has a good speed and accurate result. By the confidence of the words, we can eliminate the wrong words. Then we get a feature vector of the text.With studying several clustering algorithms, we focus on the FCM algorithm. Through the study of the FCM mathematic theory and process, we find the weakness of the algorithm. In order to overcome its weakness, we add some new element to the FCM, including cluster validity function and partial supervision cluster algorithm, putting forward a new improved FCM algorithm. Using the mathematics tools, we find a new membership function. Based on it, we get a new clustering algorithm which named PSFCM. Its main advantage is self-adopting to cluster number and strong robustness. We Validate the PSFCM on the IRIS, and prove its behavior.At last, we cluster the texts by using the PSFCM algorithm and analyze the result. It proves to be a good algorithm on text clustering.
Keywords/Search Tags:word segmentation based on statistics, partial supervision fuzzy c-means(PSFCM) algorithm, text clustering
PDF Full Text Request
Related items