Font Size: a A A

Research And System Implementation Of Chinese Word Segmentation In Specialized Fields Based On Conditional Random Fields

Posted on:2021-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:H L ZhangFull Text:PDF
GTID:2428330611456476Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the coming of the information age and the continuous development of computertechnology,natural language processing is applied in all aspects of people's lives.How to process language information efficiently is the key to the research of natural language processing technology.The Chinese word segmentation has been paid a lot of attention as the basic technology supporting natural language processing in the Chinese language environment.In these years,Chinese word segmentation is dominated by general word segmentation.If the text in the professional field is segmented,it can not get the ideal segmentation result.Therefore,improving the word segmentation effect in the professional field is an urgent problem to be solved.Aiming at the above problems,this paper proposes a word segmentation method for professional fields based on conditional random field model,and develops a system for users to directly use the word segmentation function.In the preprocessing stage,this paper builds a general training corpus based on the original general corpus,and constructs a professional training corpus in the field of maternal and child health care according to certain rules.According to the characteristics of a large number of words that express time in the professional corpus,this paper proposes a new tagging strategy,and also proposes a professional corpus tagging strategy of "long word rounding,synonymous parallelism" based on the characteristics of professional long words and more abbreviated synonyms in the professional corpus.Then set a personalized feature template,so that the model can generate feature functions and extract feature according to the lexemes and feature templates.For the feature extraction and related information storage,according to the characteristics of professional corpus,we select trie tree constructed by the same prefix to organize node information,the double array structure is used to store the Trie tree,and the pointers of the chain structure are discarded during the construction,which saves space overhead,and only needs to be added during query,which greatly reduces the time overhead.In the weight parameter training phase,we choose the maximum likelihood estimation method as the learning method of conditional random field model parameter estimation,and use the L-BFGS algorithm to optimize the parameter estimation process and improve the iterative calculation efficiency.In the prediction phase,in order to improve the prediction speed and reduce the label interference caused by non-closely associated idea groups,we propose a preprocessing method to make the target of the mark prediction be a comma-separated idea group,which improves efficiency and also increases accuracy and reduces unnecessary calculations.In this paper,the double-array Trie tree is used to perform the feature index query,and the originally expensive continuous table query is converted to addition calculation,which reduces the query time.In this paper,the Viterbi algorithm is selected as the prediction algorithm.In order to improve the efficiency of prediction and shorten the prediction time,a mark restriction selection strategy based on rules and thresholds is proposed to improve the Viterbi algorithm,where "rules" define the types of marks before and after the current mark as well as the types of the first and last position marks of the current meaning group.The threshold limits whether this mark can be used as one of the options for calculating the maximum path of the next mark.In order to make this method more widely adapted to professional corpus,this paper adopts the method of professional dictionary reverse maximum matching to post-process the results of standardized word segmentation,in order to improve the system's recognition of professional vocabulary.This paper designs experiments and verifies the positive effect of the improved marking strategy and dictionary matching post-processing on the accuracy of word segmentation,and also verifies the effect of the improved Viterbi algorithm on the speed of word segmentation.Finally,this paper designs and implements a professional word segmentation system for users to use.
Keywords/Search Tags:Conditional Random Field, Chinese word segmentation, Professional field word segmentation
PDF Full Text Request
Related items