| The research of natural language processing on Kazakh undergone several stages,from the part-of-speech,named entity recognition to chunking.We started parsing on Kazakh from past several years and now we still work for it,with good grades having been made.This paper mainly does research of phrase structure parsing on Kazakh and the author design and implement a parsing system on Kazakh.The software environment of the parsing system in this paper is Microsoft Visual Studio 2015,choosing C#as the programming language.The language model mainly used for parsing in this paper is the conditional random field model,which is a statistical-based discriminant model.The conditional random field model,which is used to mark and segment serialized data,is proposed in the idea of the maximum entropy model.The conditional random field model effectively solves the problem of tagging offset.In the paper,the author used a toolkit named CRF++ written and released by Taku Kudo.The version number is 0.58,and it is executed in the operation system of Windows.In the process of system design and implementation,I mainly complete four aspects of work:1.Corpus processing.The format of training and testing data required by the CRF++toolkit is a form in which one word occupies one line,but the format of the existing corpus we have is a multi-line hierarchical display of a sentence with a phrase and a part-of-speech tagging.So,the format of the existing corpus is needed to be transferred.There are four steps to be taken,format-changing,phrase segmentation,BIO tagging addition,and training and testing corpus division.2.CRF usage.The CRF++ toolkit of the Windows version is an executable program in the form of.exe,so the toolkit needs to be called by the system through the interface when it is used.In the period of the whole CRF usage,we need do several sub-works:the design and optimization of the feature template,the interface design of CRF toolkit usage,the CRF training with the training data using different feature templates and generating different language models,and the CRF testing with the test data using different language models.3.Model evaluation.Before evaluating the language models,we need to synthesize the corpus of the CRF test file firstly.After then,we can start calculating the important factors of evaluation.The model evaluation standard used in this paper is the PARSEVAL evaluation standard.The core factors of PARSEVAL are precious,recall and F-score.When designing the output Win Form,the author refers to the output of a toolkit named EVALB,which is developed by Satoshi Sekine from New York University and Michael John Collins from the University of Pennsylvania.The EVALB toolkit is developed by python and can be executed only in the operation system of Linux.4.Parsing demo.First,input one or more unprocessed sentences.And then,execute several steps by the software the author designed,such as part-of-speech tagging,parsing,sentence synthesis and so on.Finally,output a hierarchical phrase structure parsing with part-of-speech and phrase tagging. |