Font Size: a A A

New Word Recognition And Hot Word Ranking Methods

Posted on:2014-02-15Degree:MasterType:Thesis
Country:ChinaCandidate:S H GengFull Text:PDF
GTID:2268330392472186Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Exchanging information among people becomes much more convenient with thedevelopment of the Internet, but it also brings challenges for natural languageprocessing and the lexicographer with the emergence of mass information. How to digthe recent hot and useful information from mass information becomes more and moreimportant, and it inevitably involves the extraction of new words and hot words. Thispaper mainly studies new word recognition and the extraction of hot words.There is, so far, no consensus achieved as to the definition of new word. In thispaper, new word is defined as the word out of dictionary, and it is divided into threegroups: time words and quantifiers, name entities, the ordinary new words. The ordinarynew words take the largest proportion, so it is the main research object in this paper.In this paper, a method based on statistical and rules for new word recognition isproposed. It evaluates the words from the perspective of tightness and free usage, bymanifesting the mutual information and the left-right-entropy in statistics. In theproposed method, the first step is to preprocess corpus; the second step is to execute thestatistics of the repeated strings using the suffix array; the third step is to calculate thevalue of the mutual information and the left-right entropy of the repeated string to filterthe strings; the fourth step is determination of word boundary by the proposed Scorefunction; and then filter noisy string using the garbage dictionary and finally new wordsis automatically detected. This method has greatly improved the recall rate. Its F-valueis70%at average, improving by1%to5%. It is obvious that this method is quitefeasible.In addition, the extraction of hot words is also studied. In this paper, hot-ranking isbased on user voting ranking method. We use the Bayesian average and Newton’s lawof cooling to quantify the heat of the words, and we propose a more reasonablequantization value. We also propose a hot word evaluation standard, and experimentsshow that this quantitative method is feasible and effective.
Keywords/Search Tags:New Word Identify, Mutual Information, Left-Right-Entropy, BayesianAverage, Newton’s law of cooling
PDF Full Text Request
Related items