Font Size: a A A

Parsing The Internal Structures Of Words

Posted on:2015-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y FangFull Text:PDF
GTID:2268330428998523Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Lexical analysis is the most basic and critical step in natural language processing. In Chineseinformation processing, the general way of lexical analysis is recognizing boundaries between wordsand phrases by word segmentation so that the subsequent processing of Chinese can be the same withEnglish and other western languages. However, boundaries between Chinese words and phrases arevague, and in many cases, it is difficult for linguists to determine whether a language unit is amorpheme, a word or a phrase. This has led to serious inconsistency in human annotated corpus whichwill undoubtedly restrict Chinese subsequent processing work.The inconsistency of word segmentation standard embodies not only in different corpora, but alsoin the same corpus. In addition, different natural language processing applications have differentrequirements about words’ granularity, so it’s difficult for single word segmentation standard to meet thevarious requirements. Therefore, in view of the shortcomings of current Chinese word segmentation andthe requirements of practical applications, we propose a different choice of lexical analysis, whichanalyzes the internal structures of words. Compared to traditional word segmentation, we can recognizeword boundaries and internal structures simultaneously by analyzing structures of words. It is moreconsistent with the fact of fuzzy boundaries between Chinese lexical and syntactic units, solving theproblem of inconsistent corpus standards and meeting the different application requirements. In thispaper, we mainly focus on the analysis of internal structures of words, and carry out the followingresearch aspects:First, we present a detailed task definition of the words structures analysis, and annotate thecorpus of PKU "People’s Daily" of January1998according to the definition of task strictly. In theexperiment, we use the first80%of labeled corpus as training data and the other20%as testing data. Inaddition, there haven’t been conducted words internal structures analysis on PKU corpus, and noready-made evaluation tools can be used. Therefore, we take example by the evaluation method ofsyntactic parsing, and design a suitable evaluating method to evaluate the result of words internal structures analysis.Second, we propose a method based on cascaded CRF models to automatically parse internalstructures of words. The model comprises two parts: a bottom model and a top model. Before analyzestructures of words, the bottom model should segment the sequence of Chinese characters with finegranularity. The top model using CRF models to identify the structures of words sequence which aresegmented with fine granularity by bottom model. Experiment results show that this method ofidentifying structures of words achieves an excellent precision, and overall performance reaches apractical level.Finally, we propose another words structures analysis method by extending the word tag set. Themain idea is putting the prefixes and suffixes within words structures as special words, then identifyingthe internal structures of words through the identifying of prefixes and suffixes. Compared to wordsstructures analysis method which based on cascaded CRF models, this method overcomes the errorpropagation of fine granularity segmentation, and final experimental results demonstrate that theperformance of structural analysis is improved.
Keywords/Search Tags:Lexical Analysis, Chinese Word Segmentation, Annotation Standard, InternalStructure, Cascaded CRF
PDF Full Text Request
Related items