Font Size: a A A

Clustering Algorithms Research For Uyghur Text

Posted on:2014-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:X M DiFull Text:PDF
GTID:2248330398967119Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet today, people are in an informationexplosion era. Currently there are vast amounts of semi-structured or unstructuredinformation, how fast and efficient mining of useful information for people, is aproblem which lots of scholars are working on it. Text document clustering is amethod of automatic classification, which does not require training. However, due tothe different languages and the unreasonable feature selection or use, the performanceand the accuracy of clustering algorithms are not high. According to the specificproblems of text clustering, this paper proposed improved methods.In order to solve the problems of irregular, repetition and redundancy ofinformation in the process of selecting the phrases, an improved suffix tree clustering(STC) method is proposed. Firstly, phrase mutual information algorithm is putforward to choose the phrases abiding by Uyghur grammar. Secondly, in order toreduce the repeated phrase, the phrase reduction algorithm based on Uyghur grammaris proposed.According to high dimension and sparse information of vector space model, andthe missing of relations between each word, this paper proposes the use of the wordset to reduce the dimensions and enhance information densities. Firstly, with the rulesof the Uyghur, we process word and term relationships by LSA, then create word set;Secondly, we represent the text with the use of word sets.Experiments show that the improved suffix tree can effectively select the phrase,and the clustering effect is improved; TWCS better than the other text clustering. Notonly does TWCS make the accuracy rate achieve94.29%and the recall rate reach92.48%, but also TCWS effectively achieve the purpose of dimension reduction andincreased information consistency. This shows that the proposed method effectivelyimproves the text clustering.
Keywords/Search Tags:Uyghur, Suffix Tree (ST), Mutual Information (MI)
PDF Full Text Request
Related items