Font Size: a A A

Research On Chinese Shallow Parsing Based On Statistical Language Model

Posted on:2008-01-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:H GaoFull Text:PDF
GTID:1118360218455512Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Natural language parsing is the important and difficult task in natural language pro-cessing (NLP). In order to solve the difficulties when parsing large-scale real texts, manyresearchers have tried to divide the full parsing problem to several subproblems. Thusthe difficulties in full parsing can be degraded step by step and parsing efficiency can beimproved. Thus, shallow parsing is presented to simplify the structure of the sentences,and the aim of which is to dividing text into syntactically related non-overlapping groupswhich are simple in structure and important in significance. Shallow parsing, a newtechnique in NLP, will be of great benefit to full parsing. It is very useful for machinetranslation and other NLP tasks in which do not require a complete syntactic analy-sis, such as dictionary compilation, information retrieval, text categorization, summerygeneration and question-answer system and so on.With the widely application of empiricist approach in NLP, statistical language modelhas been the main techniques in all kinds of NLP tasks. In this thesis, Chinese shallowparsing is studied, including new word recognition, named entity recognition and textchunking, based on statistical methods.In new word recognition, a method combining mutual information and string fre-quency is presented to recognize new words except named entities. Single-characters,single-character words and adjacent multi-character words are possible components ofnew words. When compute mutual information between two adjacent components, theconfidence of the component, and its length are considered. String frequency is added intothe mutual information. The method achieves good results for new word recognition.Named entities are an important kind of unknown words. Unknown words can bringsome errors in word segmentation and those segmentation errors make the recognition ofunknown words more difficult. To solve this problem, we present a method of named entityrecognition synchronized with Chinese word segmentation based on a digraph model.Lexical word candidates and named entity candidates are the vertices of the digraph, andedges indicate the two end-points are two adjacent words. The edge weight is computedwith N-gram model to make the optimal segmentation of the sentence correspond to the shortest path of the digraph as can as possible. This method has improved the accuracyof named entity recognition.Double-rule AdaBoost (DR-AdaBoost) algorithm is presented and it is successfullyapplied in Chinese text chunking. At each round, DR-AdaBoost considers a liner combina-tion of double rules (the optimal rule and second-optimal rule) as the resulting hypothesis.Experimental results based on UCI and CoNLL shared data sets show DR-AdaBoost hasfaster convergence and higher accuracy than AdaBoost. DR-AdaBoost has better perfor-mance than AdaBoost in Chinese text chunking task and it can be used in other NLPtasks and other classifications.
Keywords/Search Tags:Statistical Language Model, Chinese Shallow Parsing, New Word Recognition, Named Entity Recognition, Text Chunking
PDF Full Text Request
Related items