A Research On Chinese Word Segmentation Based On Phonetic Annotation

Posted on:2011-11-13

Degree:Master

Type:Thesis

Country:China

Candidate:L Peng

Full Text:PDF

GTID:2178330338986026

Subject:Software engineering

Abstract/Summary:

Along With the development of the Computer Science, typically in Internet Technology, and the transformation of people's daily life,all of these results in the momentum of Big Information Bang.thus, Information Retrieval(IR) has being an important part of people's life.However,as a crucial fundament of IR,Chinese Word Segmentation(CWS) has a great impact on the accuracy and effectiveness of information retrieval. A good Chinese Information Retrieval system must base on a good system of CWS.This article first introduces the current status of the development of Chinese word segmentation, and then focuses on the algorithms of Chinese word segmentation based on statistics, as well as the conditions of maximum entropy based on statistical models, Hidden Markov Model. Chinese word segmentation system ICTCLAS CAS , which was independently developed by Computing Institute hierarchical based on hidden Markov model, provides a great help for this paper. The system was divided into atom Chinese word segmentation, N-shortest-path rough segmentation, Unknown word recognition, role-based annotation of named entity recognition, class-based Hidden Markov mark composed of the five-layer HMM. Among them, the second layer N-shortest path algorithm find the optimal segmentation results of N, role-based annotation of named entity recognition using Viterbi algorithm to mark out the global optimal sequence. By contrast tests we showed that all the levels of Hidden Markov Model for Chinese lexical analysis have played a positive role.In this paper, we proposed a phonetic-based word tagging algorithm based on ICTCLAS, it marks the original library by a phonetic dictionary, calculates the positions of each word by Six-Word-Counterpoint Method. The process of the sub-words is that, first, marks the sentences in pinyin annotation, and use dictionary-based matching algorithm to calculate the maximum of the N candidate sub-word results. The marks the positions of each pinyin words. for each results, calculate the results of each sub-word tagging probability, and select the maximum probability of sub-word as the best results. It takes effect in ICTCLAS named entity recognition system. All of this build a fundamentation for the future further research and excavation in Chinese Word Segmentation.

Keywords/Search Tags:

Chinese word, Conditional maximum entropy model, Hidden Markov model, Phonetic annotation of Chinese word

Related items

1	Research And Application Of Chinese Word Segmentation Based On Conditional Random Fields
2	Research And Implementation Of Chinese Word Segmentation Algorithm
3	Chinese Word Segmentation Based On Maximum Entropy Method Of Effective Substrings
4	The Effect Of Part Of Speech On Chinese Word Segmentation
5	Research And Implementation Of Chinese Lexical Analysis Technology
6	Study The Application And Research Of Hidden Markov Model In Chinese Geo-Entity
7	Chinese Word Segmentation System Design And Implementation
8	The Research On Chinese Text Classification
9	Chinese Word Sense Disambiguation Based On Hidden Markov Model
10	Research Of Named Entity Recognition Based On Conditional Random Fields