Font Size: a A A

Unknown Words Based On The Corpus Of The Forum Automatically Recognize The New Method

Posted on:2011-07-08Degree:MasterType:Thesis
Country:ChinaCandidate:J DouFull Text:PDF
GTID:2208360302997796Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Identification of unknown Chinese words is the bottleneck in the field.This paper presented that download adequate web documents from BBS with web spider in order to construct a corpus which was updated periodicity. Then generate candidate words list by extracting words from the corpus with this new function. Finally, compare this candidate words list and the previous lexicon, so as to recognize the unknown words. Experiments showed that the proposed method was more efficient.Different with English word, Chinese word has its own characteristics. As the composition and use habit of Chinese language, parser Chinese word is a harder problem than the English.At present, the Chinese word segmentation algorithm is mainly in three ways:based on string matching algorithms, based on understanding algorithm and based on statistical algorithms. These three methods, both in the unknown word to varying degrees, there are some problems:based on string matching algorithms can not recognize unknown words fundamentally. Based on understanding algorithm is more difficult and complexity of the time complexity and the space complexity. So it is not widely used. Based on statistical algorithm is more feasible and popular method at present, but there are also some errors in identification.Over all, based on statistical algorithm is a relatively feasible and practical application of a method. This paper studed unknown Chinese word based on statistical algorithms for unknown words identification. First, the Chinese word segmentation, especially in unknown word recognition is descripted. Secondly, the traditional word segmentation algorithms and segmentation system has been analyzed and compared. There are three kinds of traditional Chinese word segmentation algorithms:based on string matching algorithm; based on understanding of the algorithm; based on statistical algorithms. Mechanical matching algorithms can not extract unknown word from a fundamentally reason; understanding algorithm due to algorithm complexity and great difficulty, practical development and application is not widespread; Statistics algorithm in a certain extent, may solve some of unknown words, the algorithm became more popular, but it is still available in more statistical algorithms can not determine the miscarriage of justice and circumstances. This paper presented methods that download adequate web documents from BBS with web spider in order to construct a corpus which was updated periodicity which was contrarily against to the shortage of traditional ways. This step can ensure the timeliness of the corpus. Then generate candidate words list by extracting words from the corpus with this new function MD (the Mutual Information function and Duplicated Combination Frequency are combinated to construct a new statistic MD). This candidate words list and the previous lexicon were compared, so as to recognize the unknown words. Subsequently, according to this thinking program designed to test, set up a test environment. New word recall rate and accuracy of two indicators shows that this design of unknown words automatically recognize the new method is feasible.
Keywords/Search Tags:unknown word, Chinese word segmentation, web spider, corpus
PDF Full Text Request
Related items