Clustering Algorithms Research For Uyghur Text

Posted on:2014-01-06

Degree:Master

Type:Thesis

Country:China

Candidate:X M Di

Full Text:PDF

GTID:2248330398967119

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet today, people are in an informationexplosion era. Currently there are vast amounts of semi-structured or unstructuredinformation, how fast and efficient mining of useful information for people, is aproblem which lots of scholars are working on it. Text document clustering is amethod of automatic classification, which does not require training. However, due tothe different languages and the unreasonable feature selection or use, the performanceand the accuracy of clustering algorithms are not high. According to the specificproblems of text clustering, this paper proposed improved methods.In order to solve the problems of irregular, repetition and redundancy ofinformation in the process of selecting the phrases, an improved suffix tree clustering(STC) method is proposed. Firstly, phrase mutual information algorithm is putforward to choose the phrases abiding by Uyghur grammar. Secondly, in order toreduce the repeated phrase, the phrase reduction algorithm based on Uyghur grammaris proposed.According to high dimension and sparse information of vector space model, andthe missing of relations between each word, this paper proposes the use of the wordset to reduce the dimensions and enhance information densities. Firstly, with the rulesof the Uyghur, we process word and term relationships by LSA, then create word set;Secondly, we represent the text with the use of word sets.Experiments show that the improved suffix tree can effectively select the phrase,and the clustering effect is improved; TWCS better than the other text clustering. Notonly does TWCS make the accuracy rate achieve94.29%and the recall rate reach92.48%, but also TCWS effectively achieve the purpose of dimension reduction andincreased information consistency. This shows that the proposed method effectivelyimproves the text clustering.

Keywords/Search Tags:

Uyghur, Suffix Tree (ST), Mutual Information (MI)

PDF Full Text Request

Related items

1	Design And Implementation Of Suffix Tree Based Uyghur Web Page Clustering Algorithm
2	Automatic Extraction Of Uyghur Ontology Concept Classification Relationship Based On Seed Bootstrap
3	Research On Topic Detection Technology Of Uyghur News
4	Finding MUMs With Enhanced Suffix Arrays
5	Research On Construction Of Index Structure For Biological Sequences
6	Research Of A Suffix Tree Based Automatic Wrapper Generation Method
7	Multi-pattern Matching With Wildcards Based On Suffix Tree And Suffix Array
8	Extraction Research Of Uyghur Domain Term
9	An Algorithm Based On Suffix Tree For Identification Of Repeats In DNA Sequence
10	The Application Of Suffix Array In Uyghur, Kazak, Kyrgyz Search Engine