Font Size: a A A

The Method Of The Vietnamese Lexical Analysis Research

Posted on:2017-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:M M XiongFull Text:PDF
GTID:2308330488464851Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the increasing popularity of computer, the rapid development of the Internet, the number of Vietnamese text in not-stop the growth, therefore, it is urgent to obtain the necessary information for the intelligent processing of Vietnamese language. Under such social needs,Vietnamese lexical analysis is becoming more and more important role in natural language processing technology. Its performance will directly affect the following aspects, such as the performance of syntactic parsing and machine translation following application system. Lexical analysis as the basic processing steps, the initial errors will be spread along the processing chain, and ultimately affect the quality of the application system for end users.In order to solve the problems mentioned above, in the study and analysis on the basis of the existing research work carried out a Vietnamese lexical analysis method to study, mainly to complete a several aspects of research work:Marked with a certain scale of Vietnamese lexical analysis corpus. The corpus is lexical analysis research methods.Through the analysis of the Vietnamese news web page structure, preparation of the corresponding crawling program, crawling web news text information, and carrying out word segmentation and part of speech tagging for the text corpus.Vietnamese cross ambiguity resolution method based on maximum entropy was proposed. Vietnamese contains a lot of ambiguity,it is to directly impact the subsequent links such as Vietnamese word segmentation, part of speech tagging tasks. According to the Vietnamese cross ambiguity fragment,statistical features, contextual features and internal features of ambiguous fragments were selected. Integrating them into the the maximum entropy model for classification.At last, cross ambiguity resolution model was constructed.The model of the Vietnamese word segmentation with corssing ambiguity was proposed. First of all, according to the characteristics of Vietnamese word formation, the characteristics of syllable N-gram, syllable repetition and syllable type were selected, Integrating them into the conditional random field model and constructed the Vietnamese word segmentation model based on conditional random fields. In order to solve the ambiguity of segmentation caused by the influence of words, we joined the disambiguation model into word segmentation system and realized the resolution of cross ambiguity fragments.Thus,it improved the accuracy of the word segmentation system.Part of speech tagging method based on the characteristics of Vietnamese language was proposed. In the process of constructing the model of part of speech tagging, the first step is to define the common features and linguistic features that are available; the second step is to preprocess the Vietnamese language and segment the Vietnamese sentence; the third step is to define 19 Vietnamese POS tag set; the fourth step is to label the training corpus; the fifth step will mark the common features and the linguistic features of the training corpus and the first step is to define input support vector machine (SVM) training Vietnamese POS tagging model; Finally, putting the Vietnamese sementation sentence into Vietnamese POS tagging model and analysis the sentence, By analyzing the Vietnamese POS tagging and get the finally results.
Keywords/Search Tags:Maximum Entropy, Ambiguity resolution, CRFs, word segmentation, SVM, POS tagging
PDF Full Text Request
Related items