Font Size: a A A

A Research On Chinese Word Segmentation Based On Phonetic Annotation

Posted on:2011-11-13Degree:MasterType:Thesis
Country:ChinaCandidate:L PengFull Text:PDF
GTID:2178330338986026Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Along With the development of the Computer Science, typically in Internet Technology, and the transformation of people's daily life,all of these results in the momentum of Big Information Bang.thus, Information Retrieval(IR) has being an important part of people's life.However,as a crucial fundament of IR,Chinese Word Segmentation(CWS) has a great impact on the accuracy and effectiveness of information retrieval. A good Chinese Information Retrieval system must base on a good system of CWS.This article first introduces the current status of the development of Chinese word segmentation, and then focuses on the algorithms of Chinese word segmentation based on statistics, as well as the conditions of maximum entropy based on statistical models, Hidden Markov Model. Chinese word segmentation system ICTCLAS CAS , which was independently developed by Computing Institute hierarchical based on hidden Markov model, provides a great help for this paper. The system was divided into atom Chinese word segmentation, N-shortest-path rough segmentation, Unknown word recognition, role-based annotation of named entity recognition, class-based Hidden Markov mark composed of the five-layer HMM. Among them, the second layer N-shortest path algorithm find the optimal segmentation results of N, role-based annotation of named entity recognition using Viterbi algorithm to mark out the global optimal sequence. By contrast tests we showed that all the levels of Hidden Markov Model for Chinese lexical analysis have played a positive role.In this paper, we proposed a phonetic-based word tagging algorithm based on ICTCLAS, it marks the original library by a phonetic dictionary, calculates the positions of each word by Six-Word-Counterpoint Method. The process of the sub-words is that, first, marks the sentences in pinyin annotation, and use dictionary-based matching algorithm to calculate the maximum of the N candidate sub-word results. The marks the positions of each pinyin words. for each results, calculate the results of each sub-word tagging probability, and select the maximum probability of sub-word as the best results. It takes effect in ICTCLAS named entity recognition system. All of this build a fundamentation for the future further research and excavation in Chinese Word Segmentation.
Keywords/Search Tags:Chinese word, Conditional maximum entropy model, Hidden Markov model, Phonetic annotation of Chinese word
PDF Full Text Request
Related items