The Method Of The Vietnamese Lexical Analysis Research

Posted on:2017-04-22

Degree:Master

Type:Thesis

Country:China

Candidate:M M Xiong

Full Text:PDF

GTID:2308330488464851

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the increasing popularity of computer, the rapid development of the Internet, the number of Vietnamese text in not-stop the growth, therefore, it is urgent to obtain the necessary information for the intelligent processing of Vietnamese language. Under such social needs,Vietnamese lexical analysis is becoming more and more important role in natural language processing technology. Its performance will directly affect the following aspects, such as the performance of syntactic parsing and machine translation following application system. Lexical analysis as the basic processing steps, the initial errors will be spread along the processing chain, and ultimately affect the quality of the application system for end users.In order to solve the problems mentioned above, in the study and analysis on the basis of the existing research work carried out a Vietnamese lexical analysis method to study, mainly to complete a several aspects of research work:Marked with a certain scale of Vietnamese lexical analysis corpus. The corpus is lexical analysis research methods.Through the analysis of the Vietnamese news web page structure, preparation of the corresponding crawling program, crawling web news text information, and carrying out word segmentation and part of speech tagging for the text corpus.Vietnamese cross ambiguity resolution method based on maximum entropy was proposed. Vietnamese contains a lot of ambiguity,it is to directly impact the subsequent links such as Vietnamese word segmentation, part of speech tagging tasks. According to the Vietnamese cross ambiguity fragment,statistical features, contextual features and internal features of ambiguous fragments were selected. Integrating them into the the maximum entropy model for classification.At last, cross ambiguity resolution model was constructed.The model of the Vietnamese word segmentation with corssing ambiguity was proposed. First of all, according to the characteristics of Vietnamese word formation, the characteristics of syllable N-gram, syllable repetition and syllable type were selected, Integrating them into the conditional random field model and constructed the Vietnamese word segmentation model based on conditional random fields. In order to solve the ambiguity of segmentation caused by the influence of words, we joined the disambiguation model into word segmentation system and realized the resolution of cross ambiguity fragments.Thus,it improved the accuracy of the word segmentation system.Part of speech tagging method based on the characteristics of Vietnamese language was proposed. In the process of constructing the model of part of speech tagging, the first step is to define the common features and linguistic features that are available; the second step is to preprocess the Vietnamese language and segment the Vietnamese sentence; the third step is to define 19 Vietnamese POS tag set; the fourth step is to label the training corpus; the fifth step will mark the common features and the linguistic features of the training corpus and the first step is to define input support vector machine (SVM) training Vietnamese POS tagging model; Finally, putting the Vietnamese sementation sentence into Vietnamese POS tagging model and analysis the sentence, By analyzing the Vietnamese POS tagging and get the finally results.

Keywords/Search Tags:

Maximum Entropy, Ambiguity resolution, CRFs, word segmentation, SVM, POS tagging

PDF Full Text Request

Related items

1	Research And Implementation Of Chinese Word Segmentation Based On Character Tagging Method
2	Study On Disambiguation Algorithm For Chinese Word Segmentation
3	Word Segmentation And Pos Tagging In Chinese
4	Chinese POS Tagging Employing Maxent And Word Clustering
5	Tibetan Automatic Word Segmentation And Part-of-speech Tagging Research
6	A Study On Cambodian Word Method Based On Conditional Random Field
7	Research On Word-Segmentation Based On Maximum Entropy Model
8	Study Of Chinese POS Tagging Based On Maximum Entropy
9	Research On The Specification Of Chinese Word Segmentation Designed For Special Domain
10	Chinese Word Segmentation Based On Maximum Entropy Method Of Effective Substrings