Font Size: a A A

Uyghur Text Clustering System Design And Implementation Based On Python

Posted on:2013-11-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y S A N W MiFull Text:PDF
GTID:2248330374966859Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Data traffic is growing with the rapid development of Internet. Fast and efficientaccess, management and use of these data have become an important work in datamining research. Text clustering as an effective management and organization tool ofthe text, has received more and more attention and research. Text clusteringtechniques can solve these problems in a considerable extent, not only can save timebut also improve the efficiency.Firstly, this article established a larger text database based on Uyghurcharacteristics. By using this text database constructed a preliminary stop word list. Inorder to achieve the purpose of reducing the dimension of the feature space, this paperused stem extraction method. The experiment result shows that using stem extractionmethod reduced the dimension of source characteristics23%-25%.Secondly, this paper has thoroughly studied the advantages and disadvantages ofthe K-means and GAAC clustering algorithm. We come up with an improvedK-means algorithm against with the Shortcoming of instability on over-reliance on theinitial cluster centers of the classic K-means algorithm and the defects of the GAACalgorithm such as time complexity. Experimental results show that, the improvedK-means algorithm which presented in this paper is feasible and effective.Finally, we developed a Uyghur text clustering system based on python byapplying these algorithms. The system includes the three main modules, such aspreprocessing module, the text representation modules and the clustering algorithmmodule. Comparative experiments have been conducted by using the system to verifythe accuracy, stability and low time complexity of the improved K-means algorithm.The clustering results show that the system has a stable running performance.
Keywords/Search Tags:Uyghur text, Clustering, Stem extraction, VSM, Improved K-means algorithm
PDF Full Text Request
Related items