Research On Methods For Kazakh Lexical Analyzing And Phrase Parsing Based On Rules And Statistics

Posted on:2018-01-10

Degree:Doctor

Type:Dissertation

Country:China

Candidate:L L A D B K G u l i l a Gu

Full Text:PDF

GTID:1318330536481345

Subject:Computer application technology

Abstract/Summary:

Natural language processing has become one of the significant research topics during the information technology development among different countries and nations.Along with the upcoming big data era,the basic natural language information processing technology shows its own advantages and becomes one of the core technologies for every national language and text information processing and application.Lexical Analysis(LA)and Phrase Recognition(PR)are the major aspects of natural language processing technology.Quality of their research results will directly influence the performance of subsequent syntax analysis and semantics understanding and their application system.Since Kazakh is a Low-Resource Languages(LRLs)and morphological analysis of Kazakh is challenging primarily task,problems existed in Lexical Analysis and Phrase Recognition of Kazakh are still not well solved.There are still many serious challenges in thes e research areas.How to effectively proceed lexical analysis and phrase recognition of Kazakh language has become one of the core problem urgent to address among the research of Kazakh language information processing.This thesis mainly focuses on the basic steps of Kazakh lexical analysis and phrases analysis.By analyzing Kazakh morphological and phrase structure,this study established language rules for Kazakh computational linguistics.This work build Kazakh language corpus to do word frequency stati stics,morphology analysis,stemming,Part-of-Speech(POS)tagging and phrase recognition research by utilizing rule and statistic methods.This study achieved “quantitative” research of Kazakh linguistics instead of the traditional "qualitative" research.This research results not only created innovative technology for further information processing of Kazakh language,but also provided corpus data to Kazakh language linguistics research.This exploration could be utilized on machine translation,speech recognition,information retrieval and many other application developments in the Kazakh language processing.It is of great significance and has application value for current ’One Belt and One Road’ project,since Kazakh language is a cross border language.The Kazakh language belongs to the Kipchak group of Turkish language in the Altaic language family.It is formed by adding derivational or inflectional affixes to root words as an agglutinative language with word structures.This thesis focus on Arabic letters for Kazakh alphabet in P.R.China.Therefore,by the unique characteristics of the Kazakh language,based on the rules and statistical techniques,this thesis seeks for strategies that could solve issues in Kazakh lexical and phrase analysis.In particular,the main research work involves the following four aspects:Firstly,to solve resource scarcity of Kazakh language,this paper standardized the processing coding and storage scheme of Kazakh corpus.Aiming at frequency issue of Kazakh language,this research came up with word information analysis and statistic method based on corpus,which revealed language rule and phenomenon among Kazakh words information.Secondly,according to the morphological analysis of Kazakh lexical analysis,this research deeply investigated morphology analysis,affix segmentation and restore,ambiguity analysis of morphology and topography additional components.Initially,aiming at the word morphological structure,a Kazak h language analysis technology was proposed based on morphology model by morphological characteristics.Finally,aiming at Stemmer,a word character segmentation algorithm was investigated based on rules.A combination of method called ‘all segmentation + the Kazakh rule + the morphology model + maximum matching algorithm’ was explored,which achieved the morphological analysis of Kazakh.Thirdly,in order to solve POS tagging of Kazakh lexical analysis,this study proposed a Kazakh word tagging Standardization including POS tagging,stem tagging,affix tagging,multi-class word POS tagging as an attribute.A statistical model of parts of speech tagging method considering word,POS and affix as characters was proposed.As a combination of these methods,the Maximum Entropy Model and Conditional Random Fields Model were used to process automatic POS tagging of Kazakh language.These lead to the Maximum Entropy Model based on the Kazakh word POS tagging and Conditional Random Fields Model with the parts of Speech Tagging based on the multi-class word.Further cooperating with Kazakh language feature of adding derivational or inflectional affixes to root words,this work establish POS tagging based on morphological analysis.A combination POS tagging method statistical model and Language rule of Kazakh produced good results.Finally,according to the phrase recognition of Kazakh shallow syntactic parsing problem,this paper investigated the basic structure system of Kazakh basic phrase and the ambiguity analysis of the phrase structure.The results of this study determined the noun,verb and adjective phrase structure and basic rules of composition.This research proposed an automatic identification method of basic phrase based on Kazakh linguistic rules.This work explored the statistical phrase identification task based on the maximum entropy model and conditional random field model.It established phrase library.All in all,this thesis established the Kazakh natural language information processing platform by statistical language model and probability graph model according to Kazakh language rules and statistical information processing method.Based on the combination of language rules and techniques of statistical model,this paper targeted at the lexical and phrase parsing problems of Kazakh.This study systematically examines the corpus building,morphology analysis,word frequency statistics,part-of-speech tagging,basic phrase recognition of Kazakh.This research provided solving method suited for Kazakh language information processing technology,and constructed Kazakh corpus,which laid the foundation for further research on syntactic analysis and semantic analysis of Kazakh language.

Keywords/Search Tags:

Kazakh language, corpus, lexical parser, morphology analysis, Part-of-Speech tagging, basic phrase recognition

Related items

1	An Analysis Of Kazak 's Lexical Method Based On Web Corpus
2	The Development Of Part-of-speech Tagging Software For Kazakh Language
3	Study Of Kazak Part-of-Speech Tagging Based Upon HMM
4	Research On The Construction Method Of Burmese Part-of-speech Tagging Corpus
5	Research On Kirghiz Basic Part-of-Speech Tagging Based On HMM
6	Research On Part-of-Speech Tagging Algorithms Of Mathematical Corpus Based On Deep Learning
7	Research On Mongolian Lexical Analysis Based On Combination Of Statistical And Rule Approaches
8	Chinese Lexical Analysis Method Based On Morpheme Studies
9	Research On Identification Of Kazakh Basic Noun Phrase Based On Maximum Entropy
10	Research On Chinese Word Segmentation And Part-of-speech Tagging Based On Deep Learning Methods