Font Size: a A A

Research On Uyghur Discriminative Keyword Extraction Algorithm And Its Performance Analysis

Posted on:2014-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:L H M E M M T J ZuFull Text:PDF
GTID:2248330398467301Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet the information we have to deal withtoday is a large amount and increases every day. How to discover the information thatwe are really interested and needed comes to be an urgent problem. The data miningtechnique is becoming prominent due to its success in information retrieval anddiscovery, Text mining is an important research area of data mining, and keywordextraction is an important task of text mining. The goal of keyword extraction is todiscover the most representative words for a cluster of documents by text processing,which is highly important for applications such as natural language processing,document summary, classification, clustering and information retrieval.Most of the current researches focus on extracting representative keywords. Inthis thesis, we study discriminative keyword extraction, which seeks for keywordsthat are the most discriminative for document classes instead of the mostrepresentative for document contents. This research is particularly valuable for sparsedocument classification.The study began by preparing1000text documents downloaded from theinternet, involving500documents related to sanitation and medical care, and500documents related to education, computers, military, real estates, history, geography,and others.The first approach we studied is based on the multiple documents TextRankalgorithm in our experiments, this approach delivered a classification accuracy of80%with100keywords. The second approach we studied is based on discriminativestatistics derived from the term frequency/inverse document frequency (TF/IDF). Westudied10types of TF/IDF statistics: the DF-divergence, the abstract DF divergence,the TF divergence, the absolute TF divergence, the TF*DF divergence, the absoluteTF*DF divergence, the TF*IDF-divergence, the absolute TF*IDF divergence, theTF*DF*IDF divergence and the TF*DF*IDF absolute divergence. The experimentalresults show the TF/IDF approach leads to more significant discriminative capability,resulting in to a classification accuracy of98%with100keywords.As for developer kits and programming languages, we used the open sourceTextRank and LIBSVM software platform, and have achieved discriminative keywordextraction system through Perl and Python. Finally,we have made some analysis onthe system operation results.
Keywords/Search Tags:Discriminative keywords, Uighur, TextRank, TF/IDF divergence
PDF Full Text Request
Related items