Font Size: a A A

The Research And Applications Of Chinese Lexical Analysis

Posted on:2011-07-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:X SunFull Text:PDF
GTID:1118360305955731Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Words are the smallest meaningful units that can be used independently, lexical analysis is the basic step for syntactic tagging, semantic tagging and other deeply corpus processing. Most natural language processing systems, such as machine translation, speech synthesis, information extraction, document retrieval and so on, treat the word as the basic processing units, so correct lexical analysis is of great significance, In machine translation and other natural language processing tasks, the identification of words has been, and is still problematic in Chinese and other Asian language such as Japanese. Since written Chinese does not use blank spaces to indicate word boundaries, segmenting Chinese texts (Chinese word segmentation) becomes an essential task for Chinese language processing. In Chinese lexical analysis, besides Chinese word segmentation, we also need to identify the part-of-speech (POS) tags for the words and detect the unknown words.First, we proposed a pragmatic Chinese lexical analyzer integrating the word-level and character-level information based on conditional random fields (CRFs) model. The word-lattice, which represents all candidate outputs, is built by utilizing the system lexicon. The linear-chain CRF is applied in the selection of final token sequence from the word-lattice by using rich and flexible predefined features. This pragmatic method based on hybrid CRF models offers a solution to the long-standing problems in corpus-based or statistical, word-based or character-based Chinese lexical analysis.In order to make comparisons, we continue to extend the character-based Chinese lexical analysis for comparison, several extended dictionary are added into the system and corresponding features are imported for Chinese lexical analysis. We used this model to attend the SIGHAN-6 bakeoff and gained satisfying results. For meeting the demand of effectiveness, based on the maximum matching and second-maximum matching algorithm, we build the integrative Chinese lexical analyzer, which is encoded and decoded by using the HMM model. Thus, the integrative model has higher training and testing speed.Secondly, for the unknown words in the real-word text, we proposed a hidden semi-CRF model, which combines the strength of (Latent-Dynamic CRF) LDCRF and semi-CRF. The proposed hidden semi-CRF, which incorporates the character-level features and word-level features, is invoked when no matching word can be found in a lexicon and could detect the unknown words and the corresponding POS tags synchronously. Thirdly, based on the results from the pragmatic Chinese lexical analyzer, we built an extended Super Function-based Chinese Japanese machine Translator. We extended the original Super Function in three ways, the first is that the Super Function is divided in to Super Function for sentences and Super Function for phrases; the second is the scope of the variables is extended, and the third is the matching algorithm for Super Functions is proposed. With the extended Super Function, fewer Super Functions are stored in database and the precision of the Chinese Japanese machine translation is also guaranteed.
Keywords/Search Tags:Chinese Information Processing, Chinese Lexical Analysis, Conditional Random Fields, Super Function, Machine Translation
PDF Full Text Request
Related items