Font Size: a A A

Research On Morpheme Analysis Based On Conditional Random Fields In Chinese Natural Language Understanding

Posted on:2010-06-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y XiongFull Text:PDF
GTID:1118360302966604Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
With the constantly developed technology of computer and the widespread popularity of Internet, people urgently need a natural and convenient way to communicate with computers. In order to make computers "understand" human beings' language, speech recognition is the key technology for realizing the interface of human-computer interaction. The statistical language model, which is one of the cornerstones in current continuous speech recognition technology, requires the support of natural language processing technology. For Chinese language, Chinese morpheme analysis is the basic and key technology for Chinese information processing because it directly relates to the sentence analysis and semantic understanding of the next step, and finally affects the actual application system. Therefore, Chinese morpheme analysis is always a hotspot and a difficulty in present research area of Chinese information processing.In this dissertation, we first study the model of conditional random fields (CRFs) and its application to the field of Chinese morpheme analysis. The dominant training criteria and parameter optimization methods are analyzed. Under the background of Chinese morpheme analysis, new training criteria based on discriminative principle are studied, the method of overlapping ambiguity resolution based on conditional random fields is proposed and the algorithm of new words extraction and lexicon optimization is discussed in the specific domain, all of which provide a new approach and idea for Chinese morpheme analysis. Finally, we briefly describe the application of Chinese morpheme analysis in the field of speech recognition.Discriminative training criteria of CRFs are firstly investigated. Currently the training methods of CRFs are mainly based on maximum likelihood (ML) or maximum a posterior (MAP) which aim to maximize the probability of the correct labeling sequence in the training data. The best sequence selected by these models is not guaranteed to be high accuracy in the real test environment. Therefore, there is a mismatch between the training criteria and the performance evaluation metric in the task of sequence labeling. A new discriminative training criterion called minimum tag error (MTE) is proposed in the dissertation which is integrated with sentence tagging accuracy. The objective function in MTE aims to maximize the expected tagging accuracy rate on the training corpus. To calculate the average accuracy efficiently, a new forward-backward algorithm is presented and the accuracy expectation is induced. The experiments show that the MTE criterion can not only improve the F - score but also increase Roov significantly. That is to say, the MTE criterion has a clear advantage in recognizing out-of-vocabulary (OOV) words. At the same time, the MTE training method exhibits improved performance in name entity recognition.Secondly, since probabilistic graphic models such as CRFs do not take the advantage of good generalization as support vector machine (SVM) , a new discriminative training method named boosted conditional random fields (BCRF) is proposed which is motivated by the theory of large margin. The new method not only inherits the convex attribute of CRFs which can be guaranteed to achieve globally optimal solution, but also combines the generalization ability provided by large margin models. The understanding of BCRF can be regarded as a soft margin enforced between the reference sentence and the hypothesised one which is proportional to the Hamming distance (the number of errors in the hypothesised sentence) . Experiments show that the presented method achieves significant improvement compared with the traditional MAP method. The new approach can not only improve the segmentation accuracy but also increase the performance of OOV identification and name entity recognition. But when compared with the MTE criterion, although the segmentation accuracy and the recognition performance obtained by the BCRF method decrease slightly, the parameter optimization method is comparatively easy without another forward-backward algorithm.Thirdly, Chinese overlapping ambiguity resolution is discussed in this dissertation. Since SVM has the remarkable advantage on the task of classification and can be applied to deal with high-dimensional vectors, feature selection and representation based on SVM are studied for the task of resolving Chinese overlapping ambiguous strings. Based on the two different segmentation forms possibly existed in the ambiguous strings, four statistical parameters (mutual information, accessor variety, two-character word frequency and single-character word frequency) are adopted to represent different dimensional feature vectors. Classification performance is compared when different feature vectors are represented. The experiments show that feature selection and representation are vital important to improve the classification performance. High-dimensional features represented by complementary statistics can highly improve the ambiguity resolution ability of SVM classifiers. But it is very inconvenient for SVM classifiers to deal with longer ambiguous strings whose lengths are larger than three because the strings should be first converted into multiple three-character ambiguous strings. In order to solve this problem, a new method based on CRFs is proposed. Instead of the traditional methods which treated the overlapping ambiguity as a binary classification problem, the new method regards it as a sequence labeling problem. The proposed method can not only deal with overlapping ambiguous strings of any lengths no matter whether the ambiguous strings are pseudo ambiguity or true ambiguity but also consider the context information and the dependencies among the predicted labels at the same time. The experimental results show that this method achieves state-of-the-art performance.New words extraction and lexicon optimization algorithm are then studied in the specific domain. Since the training data towards specific domain are extremely scarce, supervised machine learning methods can not take their advantages. Although dictionary-based maximum matching method is simple and efficient, the segmentation accuracy is seriously influenced by the lack of specific lexicon and the constantly appeared new words. In this dissertation, by making use of heuristic rules, an initial segmentation based on a general lexicon which is served as an original lexicon is obtained. According to the initial segmen- tation, we present an improved method for new word extraction and lexicon optimization. The proposed approach selects new words based on a perplexity minimization criterion, extracts new words from the candidate word lists and adds them into the original lexicon. The augmented lexicon, which contains the new words, can be considered as the lexicon towards specific domain. To efficiently calculate the language model perplexity before and after the candidate word is added to the lexicon, a simple substituted method is proposed to approximatively estimate the perplexity change. Experiments show that this method can not only extract many specific new words, but also reduce the model perplexity and improve the segmentation accuracy.Finally, the application of language model to the field of speech recognition system is briefly introduced, and the effect on statistical language modeling and the influence on speech recognition system are analyzed for the research of Chinese morpheme.
Keywords/Search Tags:Speech recognition, language model, Chinese morpheme analysis, conditional random fields, discriminative training, large margin, support vector machine
PDF Full Text Request
Related items