Font Size: a A A

Researches Into New Chinese Words Identification Based On Large-Scale Corpus

Posted on:2012-07-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:H J ZhangFull Text:PDF
GTID:1118330335962379Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
New Word Identification (NWI) for Chinese is an essential task in the domain of Chinese information processing, which means the process of extracting new words from non-tagged text corpus and identifying their properties. The identification result will directly affect the processing performance of many tasks such as Chinese Word Segmentation (CWS) and syntax analysis. NWI also has wide applications in certain areas such as information extraction and machine translation. Therefore, NWI possesses important theoretical significance and practical value.Since Chinese has a very strong word-formatting ability and there is no specific tag between Chinese words, any two or more than two adjacent Chinese characters may format a word, which causes great difficulties in new words automatic identification. At the same time, the dramatic increases in huge amounts of data application have brought a new challenge for NWI. In this thesis, we have carried out the studies on NWI and its related technologies by taking the large-scale corpus as study object and employing the strategy of combinations of rules and statistics, for improving the performance and availability of NWI. The main work and features of this thesis are as follows:First of all, this thesis designs and implements a Framework of New Word Identification (FNWI), which is domain independent. FNWI gives a unified planning for the flexibility, scalability and maintainability of NWI system. FNWI is not only the overall design for the study of this thesis, but also provides a well-defined basis for the follow-up works.This thesis presents a hierarchical-pruning based algorithm to extract repeats for effectively dealing with large-scale corpus. This algorithm has effectively reduced the generations of waste strings and the I/O reading-writing times of the whole corpus by using the techniques of the low-frequency character pruning and the cascade pruning. The algorithm has the merits such as quickly dealing with the large-scale corpus whose capacity is much larger than the memory capacity, having near-linear relationship between reading-writing times and the size of corpus. It has the flexibilities and can be used to extracting specified frequency/length repeat strings. This thesis also gives an improved string sort algorithm whose time complexity is O(dn), to improve the merging speed of candidate repeats. In the phase of new word detection, this thesis has presented an efficient calculating method for left (right) entropy to effectively improve the detection speed and reduce the effects of calculation of unrelated characters. To analyze the impacts of different repeat extracting strategies (including character based and CWS based) on new word detection, this thesis gives an evaluating method by combining the experimental data comparison with the quantitative model analysis. Based on this method, a new pragmatic quantitative analysis model for candidate word omission is given, which used to guide the implementation of new word detection.Finally, for the classification of Part-Of-Speech (POS) of new words, this thesis presents a formal model for POS guessing of new words and employs the Conditional Random Fields (CRF) to solve it. Through the model analysis, we can determine the rules and ideas of feature selection used for POS guessing. The most important feature of this new word POS guessing method is that it mainly based on the internal features of words without using contextual POS feature, which resulting more practicability.
Keywords/Search Tags:new word identification, repeats, hirerarchical pruning, string sort, new word detection, CRF, context features, POS guessing
PDF Full Text Request
Related items