Font Size: a A A

Research On The Specification Of Chinese Word Segmentation Designed For Special Domain

Posted on:2018-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:L BaiFull Text:PDF
GTID:2348330512980208Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Chinese word segmentation is a fundamental task in Chinese natural language processing,and plays an important role in Chinese information processing.As the Chinese information processing developing in recent years,the demand for Chinese word segmentation in specific domain is increasing.However,due to the lack of segmentation annotated corpus in specific domain,the accuracy is decreased significantly when the segmentation system based on general domain segment the text in the specific domain.The reasons are as follows:(1)lacking of word segmentation specification in specific domain for standardizing terminologies' segmentation results,which affects the precision of word segmentation;(2)combining specific domain vocabulary with general domain vocabulary will cause the "cross-border" issue,which is the crossing ambiguities segmentation problem.But,existing word segmentation system cannot solve this problem better,resulting in word segmentation accuracy decreased.In order to solve these two problems,this paper studies the method of word segmentation specification in specific domain to standardize the segmentation of terminologies,annotates specific domain corpus to improve the accuracy of segmentation in specific domain.Then this paper proposes the statistical method of merging a small amount of annotation data to solve the problem of crossing ambiguities segmentation,so as to improve segmentation accuracy.The main work of this paper includes the following two aspects:(1)This paper proposes a decision tree classification method based on statistical features information for the issue of word segmentation specification in specific domain,and uses the existing statistical features of vocabulary in the news domain,including AV value,boundary entropy and string frequency value,combining with the vocabulary features in specific domain to train classification model for terminology segmentation determination and segmentation specification establishment.Under the guidance of the development of word segmentation specification,this paper annotated the corpus in the domain of science and technology automatically,and obtained large-scale annotated corpus.Experimental results show that the boundary entropy,AV value and string frequency statistic feature get the best result in the decision tree classification model,and the automatically annotated system constructed under the guidance of segmentation annotation getting the improvement of word segmentation precision.(2)The text in specific domain contains a large number of terminologies.This makes Chinese of respective boundary into a word more easily in the situation that terms are adjacent to general words,leading to word boundary segmentation more uncertainties and word segmentation accuracy decrease.The problem is crossing ambiguity segmentation.Aiming at the problem of crossing ambiguities segmentation,this paper proposes a local data annotation method based on active learning to realize the domain adaptive model.The main idea is using the original model to segment the text in specific domain,and selecting sentences which label wrong of general word,and annotate the local strings of the general word errors in the sentences.Then combining annotation corpus with training data to retrain the model,so as to adapt the specific domain.This paper uses the classification model based on the CRFs.The experimental results show that the method proposed in this paper can solve the problem of crossing ambiguity by using a small amount of annotation data.In summary,in order to improve the accuracy of word segmentation in specific domain,we make deep research on the method of word segmentation specification establishment,and propose a decision tree classification model based on statistical features,which fills gaps in specific domain segmentation specification.Aiming to the problem of crossing ambiguity in specific domain,this paper proposes a local annotation method based active learning.The experimental results verify the effectiveness of these methods.
Keywords/Search Tags:Chinese Word Segmentation, Word Segmentation Specification, Boundary Entropy, Decision Tree, CRFs
PDF Full Text Request
Related items