Font Size: a A A

Research For Chinese New Word Identification Based On Context-aware

Posted on:2013-06-05Degree:MasterType:Thesis
Country:ChinaCandidate:W TuFull Text:PDF
GTID:2248330371476983Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Languages, as a social barometer, record the history of human civilization vividly and factually. The progress of society, the rapid development of network and the emergence of new things beds a solid foundation the emergence of new words. These new words mostly reflect new things and new phenomenon in the society, also refract people’s idea changes course. The appearance of the new words promotes the enrichment and development of the vocabulary system with distinct times reinforce, also makes people’s communication more convenient, vivid and image; at the same time they brings difficulties to the grass-roots of Chinese information processing technology-automatic word segmentation. How to effectively identify these new words has become a bottleneck of Chinese automatic segmentation.Those languages, such as Indo-European ones, generally have a natural boundary between each word, but Chinese has not. Each Chinese character has a strong word-formation ability, which means any of several adjacent Chinese character sequences has the possibility of words. This is the main difficulties to identify Chinese new words. Based on the analysis of the way of produce new words and their characteristics and distribution law, this paper puts forward new words identification algorithms based on Context-Aware.First of all, this article use Web Spiders to get the network text content as the source of the corpus, to ensure the efficiency of the corpus. According to the features of the structure of Web page, use the form of DOM tree as the network storage for Web pages which are fetched from the Web Spiders, and then based on the label to extract text content to build a corpus. Secondly, this article analyzes the characteristics of the existing new recognition method, finding out these advantages and disadvantages, combining with the number distribution and word length and other characteristics of the new words in corpus, and improves N-Gram theory to get high rate of repeat string of words as candidate word string. Then, a new PPM algorithm is proposed to identify the new words from the candidate string of words, using the candidate of the words to establish context forecasting model for further identification according to the prediction model. In addition, the paper analyzes the features of the current replacement algorithm, updates the lexicon using LRU algorithm, so as to enrich and develop vocabulary system ensures that the efficiency of the lexicon. Finally, based on context awareness of new recognition algorithm idea, the paper designs the experiment, builds the experimental environment, and measures its performance. The experimental results show that this algorithm is effective.
Keywords/Search Tags:Chinese word segmentation, new word recognition, Prediction byPartial Matching, N-Gram
PDF Full Text Request
Related items