Font Size: a A A

The Research And Implementation Of Method For Domain Chinese Word Segmentation

Posted on:2014-03-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:C XiuFull Text:PDF
GTID:1268330392973340Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the development of natural language processingapplications, the demand of Chinese Word Segmentation in specific field is becominglager. Due to less corpus resources in specific field, many experiments cannot beproceeding. The majority methods of Chinese Word Segmentation in the specificdomains cannot achieve satisfactory result and meet the needs of practical application.For the research on Chinese Word Segmentation in specific domains is less. ChineseWord Segmentation in specific domains is the difficult point.In order to solve the problem of Domain Chinese Word Segmentation which isthe core issue, this paper, aims at the current status of Chinese Word Segmentation,surveys the methods of current Chinese Word Segmentation and analyzes theadvantages and problems of current methods which existing in the specific field ofChinese Word Segmentation.Aiming at this problem, use instance-based method toeliminate segmentation ambiguities. For researching on fundamental issues onChinese Word Segmentation from the corpus of both quality and quantity, proposedthe concept and evaluation methods of feature conformityand potential wordsegmentation ambiguity to establish the basis of Domain Chinese Word Segmentation.Based on above work, this article proposed methods of word segmentation and itsambiguity resolution in specific domains, which do not rely on domain trainingcorpus. The main innovative works are as follows:1) Machine learning methods based on examples of stable string can solve theproblem of Chinese word segmentation ambiguity in general field. Although Chineseword segmentation method based on conditional random fields can eliminate most ofthe original word segmentation ambiguity, but will bring more new error insegmentation. To solve this problem, this paper proposed a simple machine learningmethod which based on the "consolidation" to solve the problem of wordsegmentation ambiguity. The experimental results show that this method can easilyand effectively solution the problem of original word segmentation ambiguity, andwill not produce more ambiguous segmentation.2) Static theory model is proposed for predicting machine learning effect, theeffect of conformity between test corpus and training corpus to machine learning isstudied. Experiments show that the value of relative conformity and the value ofmachine learning are positive correlation. This model can quantitative estimatesmachine learning without labeling information for corpus, guide the choice of thetraining set, and reflect the quality of OOV in the task of Chinese word segmentation.3) Statistics of methods of OOV and word segmentation ambiguity are unified,the effect of OOV and word segmentation ambiguity to results are objectivelyevaluated. In the current analysis method of OOV and word segmentation ambiguity,the OOV statistics are in independent of word segmentation method, statistics of wordsegmentation ambiguity rely heavily on Chinese Word Segmentation method, so it isdifficult to quantitatively evaluate OOV and word segmentation ambiguity’s effectson the results. In order to solve the problems, this paper puts forward the concept of potential word segmentation ambiguity, using word segmentation ambiguity potentialmeasure the effect of word segmentation ambiguity to word segmentation results. Atthe same time, the statistical unit of OOV and word segmentation ambiguity is unified.This reveals the deep-seated reasons of the effect of different scale of test and trainingcorpora on result of the segmentation, which offered a direction of developing wordsegmentation effect.4) Methods of Chinese Word Segmentation in specific field which don’t rely onthe specific training corpus. Domain segmentation method with the combination ofvocabulary and machine learning method are proposed, identify the OOV in testingcorpora by using machine learning methods to learn specific glossary ofword-formation information. Propose domain segmentation method with combinationof glossary and unsupervised learning. Using the information of glossary to overcomeproblems that exist in the unsupervised learning method can both retain identificationability of unsupervised learning OOV and correct segmentation is obtained of thewords which already exist in the word table. Experimental data show that these twomethods can improve the effect of domain word segmentation.5) Resolution methods of domain word segmentation ambiguity based onunsupervised learning. This method use string frequency, mutual information,boundary entropy information of the corpus to solve word segmentation ambiguityproblem instead of relying on specific knowledge and training corpora. Experimentsshow that the three evaluation standard can solve domain word segmentationambiguity problems in different level, and then, improve the effect of domain wordsegmentation. Word segmentation results which use mutual information perform thebest and the most stable.
Keywords/Search Tags:Domain Chinese word segmentation, Relative conformity, Segmentationambiguity, Unsupervised, Stable string
PDF Full Text Request
Related items