Research On Chinese Word Segmentation Method Based On Statistical Learning

Posted on:2016-01-31

Degree:Master

Type:Thesis

Country:China

Candidate:W Wang

Full Text:PDF

GTID:2348330512470905

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

As a foundation of Chinese natural language processing,Chinese word segmentation(CWS)attracts more and more attention,and accurate and quick CWS is significant for understanding Chinese statement content and subsequent processing task.CWS method based on statistical learning trains segmentation model by analyzing annotated corpus and predicts the lexeme label of each character in unknown statement to recognize word boundary.Due to Chinese characteristics,existing CWS methods are difficult to recognize named entities and improved methods are dissatisfactory on the training and prediction efficiency.How to achieve CWS correctly and accurately is one of the main issues that Chinese natural language processing needs to solve.In this paper,conditional random field is adopted as CWS model,and an improved CWS method based on statistical learning is presented by analyzing CWS preprocess method and CWS algorithm.Firstly,Chinese word characteristics is studied and a compound lexeme label set(CLLS)is presented to enable CWS model recognize named entities better on the premise of introducting few parameters,and calculation method of CWS model adopting CLLS is proposed.Then,for the drawback that features acquired by existing model feature extraction(MFE)algorithm cannot express their influence on labeling result,an improved MFE algorithm is presented,which analyzes co-occurrence frequency and inter-influence of features to calculate the real-valued feature function and set a reasonable initial iteration point for model training to improve training efficiency.Furthermore,for the drawback that existing model training algorithms based on L-BFGS have a low efficiency,an improved model training algorithms is proposed to accelerate training speed and weaken negative effect of noise data by setting reasonable learning step.Moreover,an improved model prediction algorithm based on Viterbi for the CWS model adopting CLLS is proposed and then a traversal pruning strategy is introduced to improve the prediction efficiency.Finally,an improved CWS postprocessing algorithm based on error-driven transformation is presented to to further improve the accuracy of CWS method.In this paper,the actual annotated corpuses are adopted as training data set and test data sets to validate the rationality and validity of the proposed CWS method.Experimental results show that the proposed method can effectively identify the word boundaries in a given Chinese statement,and has a better accuracy and a relatively good CWS efficiency compared with other related methods.

Keywords/Search Tags:

Chinese word segmentation, statistical learning, conditional random field, model feature extraction

PDF Full Text Request

Related items

1	Research And Application Of Chinese Word Segmentation Method Based On Conditional Random Field
2	Research And System Implementation Of Chinese Word Segmentation In Specialized Fields Based On Conditional Random Fields
3	Research On Chinese Word Segmentation Based On Deep Learning
4	Research And Implement Of Chinese Word Segment Techniques Based On The Conditional Random Field
5	Research Of Chinese Word Segmentation With Conditional Random Fields
6	Research And Application Of Chinese Word Segmentation Based On Conditional Random Fields
7	Research On Road Scene Segmentation Model Based On FCN And Conditional Random Field
8	Research On Chinese Lexical Analysis Model Algorithm Based On Deep Learning
9	Research On Key Techniques For Chinese Word Segmentation With The Combination Of Deep Learning Features And Shallow Machine Learning Features
10	Research And Implement Of Chinese Word Segment Techniques Based On The Conditional Random Field