Font Size: a A A

Research On Chinese Word Segmentation Method Based On Statistical Learning

Posted on:2016-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:W WangFull Text:PDF
GTID:2348330512470905Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As a foundation of Chinese natural language processing,Chinese word segmentation(CWS)attracts more and more attention,and accurate and quick CWS is significant for understanding Chinese statement content and subsequent processing task.CWS method based on statistical learning trains segmentation model by analyzing annotated corpus and predicts the lexeme label of each character in unknown statement to recognize word boundary.Due to Chinese characteristics,existing CWS methods are difficult to recognize named entities and improved methods are dissatisfactory on the training and prediction efficiency.How to achieve CWS correctly and accurately is one of the main issues that Chinese natural language processing needs to solve.In this paper,conditional random field is adopted as CWS model,and an improved CWS method based on statistical learning is presented by analyzing CWS preprocess method and CWS algorithm.Firstly,Chinese word characteristics is studied and a compound lexeme label set(CLLS)is presented to enable CWS model recognize named entities better on the premise of introducting few parameters,and calculation method of CWS model adopting CLLS is proposed.Then,for the drawback that features acquired by existing model feature extraction(MFE)algorithm cannot express their influence on labeling result,an improved MFE algorithm is presented,which analyzes co-occurrence frequency and inter-influence of features to calculate the real-valued feature function and set a reasonable initial iteration point for model training to improve training efficiency.Furthermore,for the drawback that existing model training algorithms based on L-BFGS have a low efficiency,an improved model training algorithms is proposed to accelerate training speed and weaken negative effect of noise data by setting reasonable learning step.Moreover,an improved model prediction algorithm based on Viterbi for the CWS model adopting CLLS is proposed and then a traversal pruning strategy is introduced to improve the prediction efficiency.Finally,an improved CWS postprocessing algorithm based on error-driven transformation is presented to to further improve the accuracy of CWS method.In this paper,the actual annotated corpuses are adopted as training data set and test data sets to validate the rationality and validity of the proposed CWS method.Experimental results show that the proposed method can effectively identify the word boundaries in a given Chinese statement,and has a better accuracy and a relatively good CWS efficiency compared with other related methods.
Keywords/Search Tags:Chinese word segmentation, statistical learning, conditional random field, model feature extraction
PDF Full Text Request
Related items