Font Size: a A A

Research On Keyword Extraction Algorithm For Chinese Texts And Cluster Center Point Selection Algorithm In Text Clustering

Posted on:2017-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2308330509952547Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Nowadays, how to search the needed text information quickly and accurately is one of the most research hotpots in the text-processing area. Text clustering can improve the efficiency of information search and is an effective method for text retrieval. Keyword extraction and cluster center point selection are key problems in text clustering.Common keyword extraction algorithms can be divided into three categories,keyword extraction algorithm based on semantics, keyword extraction algorithm based on machine learning and keyword extraction algorithm based on statistical model. The keyword extraction algorithm based on semantics improves the accuracy,but depends on the background knowledge base and dictionary, so it is unable to extract words or phrases which are not included in the knowledge base. The keyword extraction algorithm based on machine learning also improves the accuracy, but the training samples and the construction model take a long time. The keyword extraction algorithm based on statistical model does not need the training samples, and does not depend on the knowledge base. The algorithm principle is simple. The common cluster center points are selected in three ways, the initial cluster center point selected randomly, the cluster center point selected by human and the cluster center point selected according to the similarity between points. The initial cluster center point selected randomly may contain "isolated points", and lead to the local optimal clustering results. The cluster center point selected by human will be subjective and not suitable for the large number of texts, because different person has different understanding of the text. The cluster center point selected according to the similarity between points are distributed in each class and as close as possible to the center point of the class, but calculating cluster center requires for a long time.In view of the above problems, the researches in this paper are as follows:(1) This paper presented a novel keyword extraction algorithm based on length and frequency of words or phrases for Chinese texts. This algorithm firstly extractswords or phrases with high frequency in the paragraph, then calculates the weight of the words or phrases according to the frequency and length of these words or phrases.Lastly, filtering out keywords according to their weights. Compared with the existing algorithms, this algorithm is not dependent on the background knowledge base and dictionary, can extract transliterated words and Internet new words, do not need the statistical parameters by training sample and constructing model.Experimental results show that the accuracy of this keyword extraction is higher,and the keywords by extracting can reflect the theme of the text.(2) This paper presented a cluster center point selection algorithm based on the similarity between texts. This algorithm firstly constructs vector space model,according to a set of text and keyword sequences of each text. Then calculating the similarity between each text and other texts. Lastly, according to the similarity between various texts, selecting cluster center points. Compared with the existing algorithms, cluster center point selected by this algorithm is similar to many more texts and similarity value is large, at the same time the similarity value between two cluster center points is lower.Experimental results show that the cluster center points selected by this algorithm are distributed in each class and are close to the center of the class.
Keywords/Search Tags:Chinese Text Processing, Text Clustering, Keyword Extraction, Transliterated Words, Internet New Words, Cluster Center Point, Vector Space Model
PDF Full Text Request
Related items