Font Size: a A A

Research Of Chinese New Word Identificaion

Posted on:2010-11-15Degree:MasterType:Thesis
Country:ChinaCandidate:L XuFull Text:PDF
GTID:2178360302960675Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of internet and society, more and more new words have come into our daily life. At the same time, the popularity of internet provides the common with the opportunity for creating new words as much as possible. New words enrich people' daily expression, but it also brings about difficulties to the Chinese information processing. With the existence of new words in the corpus, it will generate many scattered single characters while lexical analyzer is segmenting the corpus, which affecting the precision of the extraction process. Recently, the research on named entities such as person name, place name and organization name, etc, has not good achievement. However, the research on common new words is still waiting for a breakthrough.This paper gives the method which combines statistics algorithm and application of language rules. Acorrding to the characters of component model of different new word, make use of the inguistry to classify the new word problem, take single characters mode and suffix character mode as research objects.Firstly, download large scale news corpus from internet, eliminate HTML tags and others processing to get plain text, then part-of-speech tagging for the corpus and collect repeated strings which based on part of speech and stop words list. At last, we get candidate new words.With regard to single characters mode candidate new words, by means of inspecting internal combination and external linguistic environment of those candidates, which are on the basis of inside word probability, employing the combination of the average mutual information and the left and right entropy model to filter them. As suffix character mode, train a garbage tail dictionary to filter those candidates by a large corpus.Making Comparisons between the average mutual information model and the left and right entropy model, which based on the inside word probability model. Experiments show that the performance of the former is better than the latter, The F-measure of the former is 49.81% but the latter is 46.69%. Moreover, the precision of the combination of the average mutual information and the left and right entropy model achieves 70.08%, the recall value is up to 77.54%, which shows that the average mutual information model and the left and right entropy model have complementarity to some extent.
Keywords/Search Tags:Natural Language Processing, New Word Identificaion, Left and Right Entropy, Inside Word Probability, Average Mutual Information
PDF Full Text Request
Related items