Research For Chinese New Word Identification Based On Context-aware

Posted on:2013-06-05

Degree:Master

Type:Thesis

Country:China

Candidate:W Tu

Full Text:PDF

GTID:2248330371476983

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Languages, as a social barometer, record the history of human civilization vividly and factually. The progress of society, the rapid development of network and the emergence of new things beds a solid foundation the emergence of new words. These new words mostly reflect new things and new phenomenon in the society, also refract people’s idea changes course. The appearance of the new words promotes the enrichment and development of the vocabulary system with distinct times reinforce, also makes people’s communication more convenient, vivid and image; at the same time they brings difficulties to the grass-roots of Chinese information processing technology-automatic word segmentation. How to effectively identify these new words has become a bottleneck of Chinese automatic segmentation.Those languages, such as Indo-European ones, generally have a natural boundary between each word, but Chinese has not. Each Chinese character has a strong word-formation ability, which means any of several adjacent Chinese character sequences has the possibility of words. This is the main difficulties to identify Chinese new words. Based on the analysis of the way of produce new words and their characteristics and distribution law, this paper puts forward new words identification algorithms based on Context-Aware.First of all, this article use Web Spiders to get the network text content as the source of the corpus, to ensure the efficiency of the corpus. According to the features of the structure of Web page, use the form of DOM tree as the network storage for Web pages which are fetched from the Web Spiders, and then based on the label to extract text content to build a corpus. Secondly, this article analyzes the characteristics of the existing new recognition method, finding out these advantages and disadvantages, combining with the number distribution and word length and other characteristics of the new words in corpus, and improves N-Gram theory to get high rate of repeat string of words as candidate word string. Then, a new PPM algorithm is proposed to identify the new words from the candidate string of words, using the candidate of the words to establish context forecasting model for further identification according to the prediction model. In addition, the paper analyzes the features of the current replacement algorithm, updates the lexicon using LRU algorithm, so as to enrich and develop vocabulary system ensures that the efficiency of the lexicon. Finally, based on context awareness of new recognition algorithm idea, the paper designs the experiment, builds the experimental environment, and measures its performance. The experimental results show that this algorithm is effective.

Keywords/Search Tags:

Chinese word segmentation, new word recognition, Prediction byPartial Matching, N-Gram

PDF Full Text Request

Related items

1	Research On Chinese Word Segmentation Algorithm Based On News Text
2	The Research On Chinese Word Segmentation System Based On SVM
3	Research And Implementation Of Chinese Word Segmentation Algorithm
4	Research On Algorithm For Network New Word Recognition
5	Statistical Learning In Chinese Word Segmentatin And Application-specific Segmentation
6	Comparative Research On Open-Source Chinese Word Segmentation Machines
7	Improvement Of Chinese N-gram Segmentation Model
8	Research And Application On Chinese Automatic Word Segmentation In Full Text Retrieval
9	Chinese Word Auto-segmentation Design And Algorithm Realization For Chinese Network Information Retrieval
10	Maximum Matching Chinese Word Segmentation Technology Based On Word Classification And Sorting