The Unknown Words From Double Word Frequency Identification Study

Posted on:2013-01-16

Degree:Master

Type:Thesis

Country:China

Candidate:S Wang

Full Text:PDF

GTID:2245330395453278

Subject:Linguistics and Applied Linguistics

Abstract/Summary:

PDF Full Text Request

The unknown word is the main reason for the impact of Chinese automatic word segmentation accuracy,low-frequency words is the difficulty of unknown words recognition and word pairs of low frequency unknown words is an important component of low frequency unknown words. Therefore, the article focuses on how to efficiently identify pairs of words which are low frequency, the choice of the method of combining a variety of statistical and rule, and achieved some results.In the process of the dual character of the low frequency of the identification of unknown words, in order to improve the identification efficiency and the experimental results were statistically valid study, we conducted pre-processing is divided into three steps:First, the segmentation and extraction of word fragments. Second, identify an important component in the unknown word-named entity. Three multi-word part, identify unknown words. Then we distinguish low-frequency word pairs of unknown words in the remaining debris, using a variety of statistical and rule a combination of approaches, the mutual information into word probability non-words, the entropy of adjacent words, morpheme combination. Although the experimental results in general, but in the secondary identification, extraction of new words still have practical value, can alleviate a lot of the burden of artificial recognition. We found in the recognition process, the ambiguity in the definition of the word corpus carved the word inconsistent is an important reason for double word is difficult to correctly identify unknown words, we have this in-depth research, presented new pairs of words reasonable definition. After that, we marked with a small test corpus, the same identification method, the precision and recall rate has improved greatly. Finally, we also proposed and implemented a web-based discrimination method to quantify the combination of close, stable "this property, the method in determining the performance pairs of words of low frequency unknown word experiment, the F value up to86%. Be seen, the use of network resources may improve the automatic segmentation, especially the unknown words automatically identify the effect of a breakthrough.

Keywords/Search Tags:

low-frequency, double word, unknown words, morpheme, networkretrieval

PDF Full Text Request

Related items

1	Japanese Students Of Chinese Double Word Sense Of Word Formation Of Compound Words The Development Of Experimental Research
2	The Developmental Study On The Effects Of Word Frequency And Initial And Last Character Frequency For Chinese Two-morpheme Words Recognition
3	The Internal Structure Of The Common Double-tone Compound Words In Modern Chinese
4	Research On The Attributes Of High Frequency Modern Chinese Morpheme Items
5	The Construction Of Chinese Morpheme Words Knowledge Base And Its Application In Understanding Unregistered Words
6	The Lexical Structure And Semantic Structure Analysis Of Forty Thousand Words
7	Morphology in visual word recognition
8	The Grammatical Function Of The Unknown Word Guessing
9	The Influence Of The Morphemes' Frequence On The Recognize Of The Compound Words
10	A Study On The Semantic Word Formation Of Two - Word Words For The Identification And Understanding Of Ordinary Unsigned Signals