Research On Uyghur Discriminative Keyword Extraction Algorithm And Its Performance Analysis

Posted on:2014-02-20

Degree:Master

Type:Thesis

Country:China

Candidate:L H M E M M T J Zu

Full Text:PDF

GTID:2248330398467301

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet the information we have to deal withtoday is a large amount and increases every day. How to discover the information thatwe are really interested and needed comes to be an urgent problem. The data miningtechnique is becoming prominent due to its success in information retrieval anddiscovery, Text mining is an important research area of data mining, and keywordextraction is an important task of text mining. The goal of keyword extraction is todiscover the most representative words for a cluster of documents by text processing,which is highly important for applications such as natural language processing,document summary, classification, clustering and information retrieval.Most of the current researches focus on extracting representative keywords. Inthis thesis, we study discriminative keyword extraction, which seeks for keywordsthat are the most discriminative for document classes instead of the mostrepresentative for document contents. This research is particularly valuable for sparsedocument classification.The study began by preparing1000text documents downloaded from theinternet, involving500documents related to sanitation and medical care, and500documents related to education, computers, military, real estates, history, geography,and others.The first approach we studied is based on the multiple documents TextRankalgorithm in our experiments, this approach delivered a classification accuracy of80%with100keywords. The second approach we studied is based on discriminativestatistics derived from the term frequency/inverse document frequency (TF/IDF). Westudied10types of TF/IDF statistics: the DF-divergence, the abstract DF divergence,the TF divergence, the absolute TF divergence, the TF*DF divergence, the absoluteTF*DF divergence, the TF*IDF-divergence, the absolute TF*IDF divergence, theTF*DF*IDF divergence and the TF*DF*IDF absolute divergence. The experimentalresults show the TF/IDF approach leads to more significant discriminative capability,resulting in to a classification accuracy of98%with100keywords.As for developer kits and programming languages, we used the open sourceTextRank and LIBSVM software platform, and have achieved discriminative keywordextraction system through Perl and Python. Finally,we have made some analysis onthe system operation results.

Keywords/Search Tags:

Discriminative keywords, Uighur, TextRank, TF/IDF divergence

PDF Full Text Request

Related items

1	Study On Extraction Of Uygur Keywords In Public Opinion Analysis
2	Research On Chinese Text Summarization Method Based On Improved TextRank
3	Keywords Extraction Based On Word2Vec And TextRank
4	The Improvement Of Textrank And Its Application In Full Text Retrieval In Politics And Law Texts
5	Research On American Think Tank Text From The Perspective Of Keywords
6	The Design And Development Of Textrank And Log-Likelihood Based Chrome Chinese Keyword Cloud Extension
7	Research On Content-based Image Retrieval Technology
8	Research On Automatic Annotation For Chinese Text And Its Application
9	Uighur Handwriting Identification Based On Statistical Feature
10	Method Of Topic Sentence Extraction That Combined With LDA And TextRank