Research On Chinese Shallow Parsing Based On Statistical Language Model

Posted on:2008-01-04

Degree:Doctor

Type:Dissertation

Country:China

Candidate:H Gao

Full Text:PDF

GTID:1118360218455512

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Natural language parsing is the important and difficult task in natural language pro-cessing (NLP). In order to solve the difficulties when parsing large-scale real texts, manyresearchers have tried to divide the full parsing problem to several subproblems. Thusthe difficulties in full parsing can be degraded step by step and parsing efficiency can beimproved. Thus, shallow parsing is presented to simplify the structure of the sentences,and the aim of which is to dividing text into syntactically related non-overlapping groupswhich are simple in structure and important in significance. Shallow parsing, a newtechnique in NLP, will be of great benefit to full parsing. It is very useful for machinetranslation and other NLP tasks in which do not require a complete syntactic analy-sis, such as dictionary compilation, information retrieval, text categorization, summerygeneration and question-answer system and so on.With the widely application of empiricist approach in NLP, statistical language modelhas been the main techniques in all kinds of NLP tasks. In this thesis, Chinese shallowparsing is studied, including new word recognition, named entity recognition and textchunking, based on statistical methods.In new word recognition, a method combining mutual information and string fre-quency is presented to recognize new words except named entities. Single-characters,single-character words and adjacent multi-character words are possible components ofnew words. When compute mutual information between two adjacent components, theconfidence of the component, and its length are considered. String frequency is added intothe mutual information. The method achieves good results for new word recognition.Named entities are an important kind of unknown words. Unknown words can bringsome errors in word segmentation and those segmentation errors make the recognition ofunknown words more difficult. To solve this problem, we present a method of named entityrecognition synchronized with Chinese word segmentation based on a digraph model.Lexical word candidates and named entity candidates are the vertices of the digraph, andedges indicate the two end-points are two adjacent words. The edge weight is computedwith N-gram model to make the optimal segmentation of the sentence correspond to the shortest path of the digraph as can as possible. This method has improved the accuracyof named entity recognition.Double-rule AdaBoost (DR-AdaBoost) algorithm is presented and it is successfullyapplied in Chinese text chunking. At each round, DR-AdaBoost considers a liner combina-tion of double rules (the optimal rule and second-optimal rule) as the resulting hypothesis.Experimental results based on UCI and CoNLL shared data sets show DR-AdaBoost hasfaster convergence and higher accuracy than AdaBoost. DR-AdaBoost has better perfor-mance than AdaBoost in Chinese text chunking task and it can be used in other NLPtasks and other classifications.

Keywords/Search Tags:

Statistical Language Model, Chinese Shallow Parsing, New Word Recognition, Named Entity Recognition, Text Chunking

PDF Full Text Request

Related items

1	Chinese Named Entity Recognition And Shallow Parsing
2	Study On Chinese Named Entity Recognition
3	Research Of Chinese Named Entity Recognition Based On Recurrent Neural Networks
4	Chinese Named Entity Recognition Based On Neural Network And Language Model
5	Research On Chinese Named Entity Recognition Based On Deep Learning
6	Research On Chinese And English Text Entity Recognition Technology Based On Pre Training Language Model
7	Research On Chinese Named Entity Recognition Based On Deep Learning
8	Research On Chinese Named Entity Recognition And New Word Detection
9	Research On Named Entity Recognition For Science And Technology Terms Based On Dependent Entity Word Vector
10	Chinese Named Entity Recognition Based On Neural Network