Font Size: a A A

The Unknown Words From Double Word Frequency Identification Study

Posted on:2013-01-16Degree:MasterType:Thesis
Country:ChinaCandidate:S WangFull Text:PDF
GTID:2245330395453278Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
The unknown word is the main reason for the impact of Chinese automatic word segmentation accuracy,low-frequency words is the difficulty of unknown words recognition and word pairs of low frequency unknown words is an important component of low frequency unknown words. Therefore, the article focuses on how to efficiently identify pairs of words which are low frequency, the choice of the method of combining a variety of statistical and rule, and achieved some results.In the process of the dual character of the low frequency of the identification of unknown words, in order to improve the identification efficiency and the experimental results were statistically valid study, we conducted pre-processing is divided into three steps:First, the segmentation and extraction of word fragments. Second, identify an important component in the unknown word-named entity. Three multi-word part, identify unknown words. Then we distinguish low-frequency word pairs of unknown words in the remaining debris, using a variety of statistical and rule a combination of approaches, the mutual information into word probability non-words, the entropy of adjacent words, morpheme combination. Although the experimental results in general, but in the secondary identification, extraction of new words still have practical value, can alleviate a lot of the burden of artificial recognition. We found in the recognition process, the ambiguity in the definition of the word corpus carved the word inconsistent is an important reason for double word is difficult to correctly identify unknown words, we have this in-depth research, presented new pairs of words reasonable definition. After that, we marked with a small test corpus, the same identification method, the precision and recall rate has improved greatly. Finally, we also proposed and implemented a web-based discrimination method to quantify the combination of close, stable "this property, the method in determining the performance pairs of words of low frequency unknown word experiment, the F value up to86%. Be seen, the use of network resources may improve the automatic segmentation, especially the unknown words automatically identify the effect of a breakthrough.
Keywords/Search Tags:low-frequency, double word, unknown words, morpheme, networkretrieval
PDF Full Text Request
Related items