Research Of Chinese New Word Identificaion

Posted on:2010-11-15

Degree:Master

Type:Thesis

Country:China

Candidate:L Xu

Full Text:PDF

GTID:2178360302960675

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of internet and society, more and more new words have come into our daily life. At the same time, the popularity of internet provides the common with the opportunity for creating new words as much as possible. New words enrich people' daily expression, but it also brings about difficulties to the Chinese information processing. With the existence of new words in the corpus, it will generate many scattered single characters while lexical analyzer is segmenting the corpus, which affecting the precision of the extraction process. Recently, the research on named entities such as person name, place name and organization name, etc, has not good achievement. However, the research on common new words is still waiting for a breakthrough.This paper gives the method which combines statistics algorithm and application of language rules. Acorrding to the characters of component model of different new word, make use of the inguistry to classify the new word problem, take single characters mode and suffix character mode as research objects.Firstly, download large scale news corpus from internet, eliminate HTML tags and others processing to get plain text, then part-of-speech tagging for the corpus and collect repeated strings which based on part of speech and stop words list. At last, we get candidate new words.With regard to single characters mode candidate new words, by means of inspecting internal combination and external linguistic environment of those candidates, which are on the basis of inside word probability, employing the combination of the average mutual information and the left and right entropy model to filter them. As suffix character mode, train a garbage tail dictionary to filter those candidates by a large corpus.Making Comparisons between the average mutual information model and the left and right entropy model, which based on the inside word probability model. Experiments show that the performance of the former is better than the latter, The F-measure of the former is 49.81% but the latter is 46.69%. Moreover, the precision of the combination of the average mutual information and the left and right entropy model achieves 70.08%, the recall value is up to 77.54%, which shows that the average mutual information model and the left and right entropy model have complementarity to some extent.

Keywords/Search Tags:

Natural Language Processing, New Word Identificaion, Left and Right Entropy, Inside Word Probability, Average Mutual Information

PDF Full Text Request

Related items

1	New Word Recognition And Hot Word Ranking Methods
2	Word Sense Disambiguation Corpus Automatic Acquisition
3	Research On Words Segmentation Algorithm And Word Variant Extraction Method Of Message Variety Based
4	Research On Chinese Word Segmentation Integrating Pinyin And Tone Information
5	Research On Hot Word Analysis Technology For Microblog Text
6	Research On Machine Learning For Natural Language Processing And Transmission
7	Word Embedding Revision Based On Morphological Information And Semantic Lexicons
8	Research On Word Sense Disambiguation Method Based On Word Embedding
9	Research On Algorithm For Network New Word Recognition
10	The Research Of Chinese Word Segmentation Algorithm Based On Dictionary And Probability Statistics