Font Size: a A A

N-gram Technology Application Study In Computer Processing Of Chinese Language

Posted on:2010-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:J QinFull Text:PDF
GTID:2178360275985964Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As the rapid development of the computer technology and Internet, online documents have become one of the major modern information media as well as an indispensable information source in people's lives. While the Web is coming, people tend to take the initiative to change the acquisition, publishing, sharing and dissemination of information. In recent years, the information processing is recently becoming a hot spot in the study on the Web. Lexical Analysis is a foundational task of natural language processing, so it greatly influences the syntactic analysis and successive applications of lexical analysis. In this text, lexical analysis includes the Chinese word segmentation. As a prerequisite part, early error in lexical analysis will cascade through the chain, causing the whole effect on the final performance, such as the performance of information retrieval, question answer system and machine translation.The main difficulties to improve lexical analysis include ambiguity problem, sparse data problem and independent identical distribution assumption. This dissertation focuses on the lexical analysis task, and research with the statistic approach. In terms of the model: As for supervised learning, we explore the N-gram model and the application of N-gram technology in the computer processing of Chinese language. And we do deep research in N-gram model of the computer processing of Chinese language with above theories and approaches. The dissertation concerns the following aspects:(1)Firstly,the dissertation expounds the current situation of the computer processing of Chinese language and N-gram technology. The computer processing of Chinese language includes the methods based on the Chinese understand and statistics. The dissertation focuses on statistical method.(2) Secondly, lexical analysis is a basic technology of computer processing of Chinese language, and the dissertation expounds the current situation of lexical analysis. It points out the difficulty of lexical analysis and some kinds of statistics models. This dissertation analyses the N-gram model and the mathematical formula. It leads into the smoothing algorithm.(3) At last,it designs a multi-mode-based hybrid Chinese word segmentation system for Web text. The dissertation expounds the current situation of Chinese word segmentation, and puts forward the main difficult points. The dissertation expounds Chinese word segmentation and lexical analysis, the N-gram model. According to the characteristics of web text,it focuses on the Out-of-Vocabulary word identification problem in Chinese. The dissertation puts forward an experiment based on the N-gram model. And the result indicates the N-gram be feasible in neologism distinguishing. In the field of language characteristic, it investigates several N-gram templates to represent text features.
Keywords/Search Tags:N-gram Model, Computer Processing of Chinese Language, Lexical Analysis, Chinese word segmentation, Neologism Distinguishing
PDF Full Text Request
Related items